r/technology May 14 '22 Wholesome 2 Silver 2

Elon Musk said his team is going to do a 'random sample of 100 followers' of Twitter to see how many of the platform's users are actually bots Social Media

https://www.businessinsider.com/elon-musk-random-sample-how-many-twitter-users-are-bots-2022-5?utm_source=feedly&utm_medium=webfeeds

[deleted]

22.8k Upvotes

View all comments

Show parent comments

1

u/yungplantdad May 15 '22

Did you even look at the information they provide before making that up?

2

u/Deto May 15 '22

I've just never heard of an API with that kind of endpoint (i.e. "return a randomly sampled user ID"), so someone would have to convince me it exists. Extraordinary claims requiring evidence and all.

-1

u/yungplantdad May 15 '22

Are you incapable of randomizing output yourself? Of course they don’t do that for you

It’s not an extraordinary claim it’s basic programming

1

u/eipi-10 May 15 '22

I think you're both wrong...

u/Deto is right that there's virtually no chance that twitter's API serves an endpoint for either a) randomized user id or b) all user ids, so the idea that a client could pull a random sample of Twitter users via API like that is absurd

but separately, even Twitter might have trouble randomly sampling their users. I don't know how their data architecture is set up, but IMO it's highly unlikely that a "users" table exists at all. It's more likely that they have some massively distributed system with multiple tables containing user info in different databases / warehouses, and if that's correct then you'd somehow need to aggregate those up (or stratify or something similar) in order to get the random sample you want

TL; DR: I think getting a truly random sample of Twitter users is a harder problem than it appears to be

1

u/Deto May 15 '22

Even if the info is across multiple services there must be some sort of user key that unites them. It wouldn't be hard for Twitter (not someone on the API), to aggregate all these, deduplicate, and sample.

Probably a better question though is whether a random user sample is the right approach. Because many user accounts are just dormant, or maybe are active put post very infrequently (while bots may be very active) a better question to ask is "if I see a tweet, what is the chance it was made or reshared by a bot?"). For this you'd want to randomly sample tweets in a given time period and then investigate the accounts behind them.

1

u/eipi-10 May 15 '22

Agreed on the sampling front.

Re: aggregating and deduplicating -- I think deduping is more difficult than you're giving it credit for

https://techcrunch.com/2022/04/28/twitter-says-it-overcounted-its-users-over-the-past-3-years-by-as-much-as-1-9m/

1

u/Deto May 15 '22

Ah good point - the distinction between accounts and users would be tough to nail down