Just Another Day on Twitter: Maybe one of the last large Twitter datasets

This blog post discusses a recent large-scale Twitter dataset and its potential applications in research. The dataset consists of over 375 million tweets collected over 24 hours in 2022 by 80 scholars. The data provides a valuable reference point for future research projects and presents exciting research opportunities, including examining harmful online content and identifying cultural differences. The full dataset is hosted on GESIS Archive, and potential API limitations could restrict its access in the future.

In diesem Blogbeitrag werden ein kürzlich veröffentlichter großer Twitter-Datensatz und seine möglichen Anwendungen in der Forschung erörtert. Der Datensatz besteht aus über 375 Millionen Tweets, die im Jahr 2022 über 24 Stunden von 80 Wissenschaftler*innen gesammelt wurden. Die Daten bieten einen wertvollen Bezugspunkt für künftige Forschungsprojekte und eröffnen spannende Forschungsmöglichkeiten, darunter die Untersuchung schädlicher Online-Inhalte und die Identifizierung kultureller Unterschiede. Der vollständige Datensatz wird im GESIS-Archiv gehostet, und mögliche API-Beschränkungen könnten den Zugang in Zukunft einschränken.

DOI: 10.34879/gesisblog.2023.65

Twitter has become the “Drosophila melanogaster” of the social sciences – one of the best-studied model organisms (systems) that has been used to study social phenomena.¹ With Elon Musk’s turbulent and controversial acquisition of the Social Media Platform, many changes have been introduced to the medium. But these changes affect not only ordinary users and their perception and interaction on the platform but also researchers. While Twitter was known for its academic-friendly Application Programming Interface (API), enabling easy and free access to the Tweet stream and historical Tweets, recent announcements indicate that this will fundamentally change in the future. When this blog post was written, Twitter had just announced that they would introduce a monthly fee of $100 to access the API, but it remains to be seen how much access will be granted.²

The primary objective of this blog post is to emphasize the critical role that social media datasets play in research and to examine the possible consequences of data access restrictions imposed by the platforms for both current and future investigations. To illustrate this point, I will use the example of one of our most recent large-scale Twitter datasets that sought to provide an accurate representation of a full day (24 hours) on the platform.³

What are the characteristics of the dataset?

The dataset consists of over 375 million Tweets posted on the Twitter platform between September 20 and September 21, 2022. A group of 80 scholars coordinated to collect this massive amount of data.

Figure 1: Number of Tweets per minute over the collection period for the 24h dataset. Apparent Tweet bursts in the first minute of each hour.

Why is it necessary to have 24 hours of data?

There are a plethora of datasets focussing on different aspects of Twitter as a “digital town square”, so one could ask why this new dataset is so vital for the research community. Although Twitter was – at least up until recently – lauded for its permissive data collection policy, various potential biases that originate from the intransparent sampling strategy of the API have been observed.⁴ With this new data collection, we finally have a point of reference that can be used to estimate datasets’ quality and potential biases, answering questions of representativeness, access, and censorship.

Is the data complete?

Yes and no. We were interested in a stable representation of Twitter. Usually, within the first minutes, fresh Tweets may be deleted for Terms of Service violations by the platform or to correct misspellings by the user. Therefore, we decided to gather Tweets 10 minutes after they were posted.

What are the Possible Applications of the Data?

The dataset, capturing a snapshot of a single day in 2022, has the potential to serve as a valuable point of reference for future research projects. Researchers can use it to track changes on the platform over time, especially since Elon Musk’s acquisition of the platform and subsequent modifications. In addition, the data itself presents exciting research opportunities, such as exploring and approximating the prevalence of harmful online content on the platform, identifying cultural and cross-lingual differences, and examining the contexts in which they occur. These are just a few of the many fascinating aspects that can be studied using the dataset.

Where is the data located?

The full dataset is hosted on GESIS Archive.⁵ We only share the corresponding Tweet IDs to avoid violating Twitter’s Terms of Service. Currently, the IDs can be used to recollect all or a subset of the Tweets via the Twitter API (we call this method rehydration).

What happens if Twitter now restricts API access?

If Twitter imposes restrictions on API access, it will pose a significant impediment to other researchers utilizing our dataset for their research. As the rehydration of Tweets would no longer be possible, the only alternative means of obtaining the data would be to request a copy of the full dataset – not just the Twitter IDs – from the individuals who originally collected it. However, the sharing of full datasets would violate Twitter’s Terms of Service, thus creating a significant challenge for researchers seeking to access the data.

Future Directions

For many scholars who have used Twitter as a model organism/system in the past, this is a cutting point. Over the past decade, Social Media data has become an exceptional resource for data that supplements traditional survey data. Computational Social Science research largely depends on the availability of data to investigate and comprehend the phenomena of harmful online communication, such as the proliferation of hate speech and the dissemination of misinformation and disinformation campaigns. With the introduction of API restrictions, researchers are hindered in conducting new research on unexplored datasets and are limited in replicating experiments from previous studies. This ultimately leads to the exacerbation of the already existing skepticism toward social media analytics.⁶ But at the same time, we as a research community can and should treat this situation as an opportunity to talk about our problematic dependence on the Twitter platform and work towards solutions and suitable alternatives.

References

Tufekci, Z. (2014, May). Big questions for social media big data: Representativeness, validity and other methodological pitfalls. In Eighth international AAAI conference on weblogs and social media.
https://www.nature.com/articles/d41586-023-00460-z
Pfeffer, J., Matter, D., Jaidka, K., Varol, O., Mashhadi, A., Lasser, J., … & Morstatter, F. (2023). Just Another Day on Twitter: A Complete 24 Hours of Twitter Data. arXiv preprint arXiv:2301.11429.
Olteanu, A.; Castillo, C.; Diaz, F.; and Kıcıman, E. 2019. Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data, 2: 13.
https://doi.org/10.7802/2516
Assenmacher, D., Weber, D., Preuss, M., Calero Valdez, A., Bradshaw, A., Ross, B., … & Grimme, C. (2021). Benchmarking crisis in social media analytics: a solution for the data-sharing problem. Social Science Computer Review