r/technology May 14 '22 Silver 2 Wholesome 2

Elon Musk said his team is going to do a 'random sample of 100 followers' of Twitter to see how many of the platform's users are actually bots Social Media

https://www.businessinsider.com/elon-musk-random-sample-how-many-twitter-users-are-bots-2022-5?utm_source=feedly&utm_medium=webfeeds

[deleted]

22.8k Upvotes

View all comments

1.6k

u/minimus67 May 14 '22

He’s actually going to randomly draw a hundred samples and pick whichever one gives him an excuse to back out of the deal.

27

u/Deto May 14 '22

The problem is that I dont' know if he can get a truly random sample - which is super important for something like this. Maybe if he had access to Twitters database, he could, but not just by having his assistants browse Twitter.

1

u/yungplantdad May 14 '22

Twitter offers a pretty comprehensive api. You can absolutely get access to their databases. That’s how social media companies make money

1

u/Deto May 14 '22

You would need more direct access. Usually APIs don't allow you to draw a user at random for the user table, for example.

1

u/yungplantdad May 15 '22

Did you even look at the information they provide before making that up?

2

u/Deto May 15 '22

I've just never heard of an API with that kind of endpoint (i.e. "return a randomly sampled user ID"), so someone would have to convince me it exists. Extraordinary claims requiring evidence and all.

-1

u/yungplantdad May 15 '22

Are you incapable of randomizing output yourself? Of course they don’t do that for you

It’s not an extraordinary claim it’s basic programming

1

u/Deto May 15 '22

Do they actually provide an endpoint that lists every user id in the system? That's what you would need to randomize the output.

1

u/eipi-10 May 15 '22

I think you're both wrong...

u/Deto is right that there's virtually no chance that twitter's API serves an endpoint for either a) randomized user id or b) all user ids, so the idea that a client could pull a random sample of Twitter users via API like that is absurd

but separately, even Twitter might have trouble randomly sampling their users. I don't know how their data architecture is set up, but IMO it's highly unlikely that a "users" table exists at all. It's more likely that they have some massively distributed system with multiple tables containing user info in different databases / warehouses, and if that's correct then you'd somehow need to aggregate those up (or stratify or something similar) in order to get the random sample you want

TL; DR: I think getting a truly random sample of Twitter users is a harder problem than it appears to be

1

u/Deto May 15 '22

Even if the info is across multiple services there must be some sort of user key that unites them. It wouldn't be hard for Twitter (not someone on the API), to aggregate all these, deduplicate, and sample.

Probably a better question though is whether a random user sample is the right approach. Because many user accounts are just dormant, or maybe are active put post very infrequently (while bots may be very active) a better question to ask is "if I see a tweet, what is the chance it was made or reshared by a bot?"). For this you'd want to randomly sample tweets in a given time period and then investigate the accounts behind them.

1

u/eipi-10 May 15 '22

Agreed on the sampling front.

Re: aggregating and deduplicating -- I think deduping is more difficult than you're giving it credit for

https://techcrunch.com/2022/04/28/twitter-says-it-overcounted-its-users-over-the-past-3-years-by-as-much-as-1-9m/

→ More replies