Tools & Practices

Investigating data available on social media platforms 

Although social media is an important “channel” for publishing journalistic investigations, it is also a rich source of information for all kinds of journalistic research. At the Dataharvest 2025 conference, we learned useful ways in which journalists can gather data from platforms for their investigations. 

Social media is undoubtedly another way for journalistic investigations to reach a wide audience. However, they can also serve as useful research tools. Investigative journalists have found ways to produce stories by examining the data they extract from online platforms. At the 2025 Dataharvest – The European Investigative Journalism Conference in Mechelen, Belgium, data journalism professionals presented tips and tools for accessing the hidden world of social media.  

  1. Determine which platform best suits your research needs 

Each medium provides a different type of information, which may as well change the nature of the investigation. On platforms such as X, Facebook and Instagram, posts are based on content such as photos and videos. TikTok posts are exclusively in video format. In addition to the content of the posts themselves, these media can also be searched for accompanying descriptions, comments, hashtags, as well as the public information and profiles of users who post or interact with the content in question. Telegram is different in that it is primarily a messaging platform, so the information it provides is different. It includes channels, chats and groups, and is of interest because many of these may have been banned or restricted on other platforms, such as extremist organizations, far-right groups and disinformation networks. The data may come from publicly available content or from sources accessed within these different environments. 

  1. How long will you monitor the platforms? 

The frequency of content monitoring depends on the subject of the investigation and the characteristics of each platform. During the session “Effective Investigations on Telegram for Journalists,” Sayyara Mammadova, a journalist and researcher at the Atlantic Council’s Digital Forensic Research Lab (DFRLab), mentioned that when research focuses on specific user profiles, the analysis of older data can provide a more comprehensive picture of the activity and profile of the person under investigation. On the other hand, if the goal is to record posts from a specific time period –such as the British elections, for example– then daily monitoring of content is necessary for the duration of that period.  

  1. Access to data 

Perhaps the most difficult part of the process is gaining access to the information necessary for the investigation. Over the years, platforms have significantly restricted API (Application Programming Interface) services, which were used to search for and store information and statistics related to user traffic and habits on social media. (Editor’s note: API, or Application Programming Interface,  

is the interface through which data from an application or platform is sent to the user with the responses and results requested from that application or platform.) 

One of the most significant changes was made to Twitter (now X), which previously allowed access to its data for academic research purposes. In February 2023, Elon Musk began charging a fee for access to data on the X platform, which significantly limited the ability to conduct academic as well as journalistic research. Specifically for the latter, according to the website, although access to data is provided through subscription packages, subscriptions for journalists do not appear to be comparable to those for academics and marketing professionals. Businesses on Meta platforms, such as Instagram and Facebook, can also access data via API for a fee.  

However, on some of these platforms there are alternative ways to access the data. In the session “An Investigative Method to Measure Content on TikTok,” data journalists Carmen Aguilar Garcia and Zeke Hunter-Green from The Guardian suggested that participants should not focus on creating their own code, but rather use code provided for free by developers –and others. “Join forces with others who want to do the same thing you do. Because automatic data extraction programs (scrapers) are constantly “breaking”, repairs can take a long time. Scrapers that are readily available tend to be up to date and most of the time work”. 

They used one of them for their own research. TikTok presents difficulties when it comes to “scraping”. “We are investigating a very obscure algorithm that can change every day without us knowing how,” Garcia said. With the help of the TikTokApi library in Python and Playwright –a library automation that controls browsers via code for scraping purposes– data can be extracted with a lower chance of being detected by the platform.  

The session “Protests, TikTok, and more: analyzing images and videos with AI,” by Jonathan Soma, Knight Professor of Professional Practice in Data Journalism and Director of the Data Degree Program at Columbia University, was a follow-up to his presentation at this year’s NICAR conference in March in Minnesota, where he presented ways to “scrape” X.  

One way to do this is to save .har and .warc files, which record a user’s interactions with a website. Those .har files store information about network requests made by the user and the server in the browser. The .warc files store web content in its original environment and are used for archiving web pages. These files can be used to extract data such as the content of web pages at the time of scraping, API responses, tweets, user profiles, links to photos and videos, and more.  

For .har files: 

  • from Chrome, simply follow these steps Right-click –> Inspect –> Network –> Download ↓ 

For .warc files:  

  • Use a tool like the WARC Data Extractor to extract tweet content and metadata, as well as embedded .json files. 

After extracting the data, it can be stored or converted into more readable formats, such as .csv or .json, for analysis. 

For Instagram, Soma recommended using the Instaloader tool to download large amounts of data from different profiles. The program is easy to install and can output a profile’s data with just three lines of code: 

 
pip install instaloader
pip install getpass
 
# Import 'instaloader' and 'getpass' in jupyter notebook
import instaloader
from getpass import getpass

# Connect to your profile in order to have access to more data
username = 'profile_name'
password = getpass("Enter Instagram password: ")

# Degine 'instaloader'
ig = instaloader.Instaloader()

#Add the profile name of the wgich the data you want to accese
insta_page = input("Enter the name of the instagram page")

# 'True' if you only want to download the profile picture
# 'False' if you want to download the rest of the data
ig.download_profile(insta_page, profile_pic_only=False)

More information available on Instaloader‘s GitHub

As for Telegram, Mammadova presented a more optimistic picture. Telegram itself allows users to save the data contained in channels by following this path: Settings –> Export Telegram data and selecting both the data type and the format in which it will be saved on the computer (.html or .json). Furthermore, specific chat conversations can be saved by selecting “Export Chat history”.  

As an alternative, data can be stored using the Telegram Tracker tool developed and maintained by DFRLab researcher Esteban Ponce de León.  

  1. What if the data has been deleted? 

In the workshop “More than just the Wayback Machine: how to investigate deleted and archived content,” journalists Jasmine Jacot-Descombes and Jean Ludwig, from the Swiss newspaper Neue Zürcher Zeitung, who specialize in open source intelligence (OSINT) research, presented traditional ways in which one can access deleted profiles, posts and comments. The workshop presenters mentioned searching online archives such as the Wayback Machine , Ghost Archive, Cyber Detective and Archive Today.  

Results will appear when you type the link of the deleted post into a browser’s search bar. The number of results will change when you apply a few practices:  

Original link (example):  

https://www.instagram.com/profile_name/p/C_btAIKOW2z

  • Remove the item/profile you are looking for from the link of the post you are searching for:  

https://www.instagram.com/p/C_btAIKOW2z

  • Add slash (/) at end of the URL 

https://www.instagram.com/profile_name/p/C_btAIKOW2z/

  • Replace slash (/) with an asterisk (*) to show more versions/saves of the page 

https://www.instagram.com/profile_name/p/C_btAIKOW2z*

  • Add a second asterisk after the reference to the internet archive you are using  

https://web.archive.org/web/*/https://www.instagram.com/profile_name* 

  • To limit the results to a specific time period, the date can be added in the form of YYYY (year) MM (month) DD (day) 

https://web.archive.org/web/202305*/https://www.instagram.com/profile_name*  

Often, the information is not limited to the official accounts of the persons under investigation. Examining as many sources as possible in depth is important. As Jacot-Descombes and Ludwig point out, research can focus on an individual’s comments and interactions, which often remain visible and revealing, even when there is no access to their profile. 

 An effective way is to search on Google, Bing, and other search engines, each of which will return different results using the format “site:instagram.com profile_name”. This method can be used for any platform, such as YouTube, X, Facebook, etc. “The results on Google do not show the internet in real time,” says Ludwig. “These are cached versions, so you may find outdated information.” 

Not all content in X has been archived, making it difficult to search for certain threads. The habit of users responding to long threads with the word “unroll”, asking bots to collect them, may prove useful for research purposes. Therefore, searching for the word unroll in the replies is an effective method for locating threads that may not have been archived elsewhere. Tools such as Thread Reader can simplify this process.  

In the case of Telegram, Mammadova and the Bellingcat research team at Bellingcat Toolkit recommend the TGStat tool (from a Russia-based company) which facilitates both access to deleted data and research into channels linked to the persons under investigation. Although it is primarily free, there have been concerns about the security of user data on the app.  

The main image depicts the session “More than just the Wayback Machine: how to investigate deleted and archived content” and was created during Dataharvest 2025 by Pieter Fannes.

Translated by: Evita Lykou