December 17, 2024 - Comments Off on Limits on data scraping—Terrible news for social media research and platform accountability?
Limits on data scraping—Terrible news for social media research and platform accountability?
Early last year, following Elon Musk’s takeover, X (formerly Twitter) introduced paid API access tiers, dealing a deathblow to its long-standing role as a uniquely valuable resource for academic research.
APIs, or Application Programming Interfaces, act as bridges that connect two software applications, enabling data exchange. For instance, an API can be used to request data from X and store it in a structured format like a CSV file. Paid API tiers mean that the previous official Twitter/X API is no longer functional, which further means data cannot be scraped from X for free.
This measure, alongside introducing ‘rate limits’ (limits to the number of tweets users could view per day), was ostensibly taken to prevent what Musk referred to in a July 2023 tweet as “...extreme levels of data scraping & system manipulation…”, which were slowing down the site and hindering the user experience. However, no further explanation was provided by Musk to support the claim of “extreme levels of data scraping”. X users pointed out that if these indeed exist, they would likely manifest as anomalies that can be identified as request spikes, and thus easily narrowed down to identify and ban users requesting abnormally large amounts of data.
Unreasonably priced paid API tiers
Nevertheless, Musk’s crackdown on data scraping continued, and as of now there are four API tiers: free, basic, pro, and enterprise. The free tier allows 100 reads per month, a negligible amount for researchers and academics looking to analyse posts on X for meaningful insights, patterns, and key data points. The most affordable paid tier only offers around “0.3 percent of [data] [third parties] previously had free access to in a single day”, with enterprise tier prices shooting to upwards of $42,000 per month. It is also impossible to read posts without being logged into X.
X’s crackdowns on data scraping
As a means to enforce these measures, X has regularly been cracking down on third-party API access, blocking IP addresses and mass suspending developer accounts, and bringing costly lawsuits against non-profits whose research utilises data scraped from X without authorised access.While one might believe Musk’s policies aim to combat misuse and preserve platform integrity, the fact that X, a year after introducing paid API tiers, amended its terms of service to permit training its generative AI model on user posts without their consent makes such intentions harder to trust.
The value of data scraping to social media research
These recent and ongoing changes to X’s policies represent a strange and worrying landscape for data accessibility in academic research. Academic researchers have long relied on X’s open model and public data for producing valuable medical research, including enriching datasets for cancer treatment methods, and tracking mental health trends during public health emergencies like COVID-19.
In addition to X, scraped data across platforms has also been used to collect evidence for studies that have offered positive benefits to society, such as uncovering illegal markets for adopted children on Yahoo bulletin boards, revealing overlaps between law enforcement and extremist Facebook groups, and identifying problems with TikTok’s algorithm targeting youth. From the digital rights lens, this model has improved understanding of the implications of harmful online content for society and provided the rights community an opportunity to share policy recommendations. Without it, tracking social media companies’ adherence to their own content moderation rules for regulating harmful content is increasingly challenging.
To the dismay of researchers, X is not the only platform that restricts access to their API. Facebook, Instagram, and LinkedIn, have long restricted access to user data.
The legal debate around scraping public data
Legally, social media companies like X and LinkedIn have come under fire for bringing claims against the scraping of public data on their platforms. In X Corp. v. Bright Data Ltd. (2024), X Corp asserted breach-of-contract and tort claims against a data scraping company, Bright Data, to prevent it from extracting and copying public data from X, and selling tools that enable users to extract and copy public data from the same.
The case was ultimately dismissed because none of the claims passed muster, and in its ruling the U.S. district court quoted a previous judgement for a case brought by LinkedIn against data scraping by an airline website in 2022:
…giving social media companies ‘free rein to decide, on any basis, who can collect and use data - data that the companies do not own, that they otherwise make publicly available to viewers, and that the companies themselves collect and use - risks the possible creation of information monopolies that would disserve the public interest.’
HiQ Labs, Inc. v. LinkedIn Corp., 31 F.4th 1180, 1202 (9th Cir. 2022) qtd. in X Corp. v. Bright Data Ltd., C 23-03698 WHA, 2 (N.D. Cal. May. 9, 2024)
These recent court rulings highlight the risks posed to public interest by the arbitrary crackdown on data scraping. They underscore how social media platforms hold “arms length”, non-exclusive rights to user data, and have no written exclusive copyright license over user’s works. Users own sole exclusive rights to their data, and only owners of exclusive rights can seek protection and remedies from courts. Litigation brought by X against data scraping of public data is an overreach of its rights, as explained in the X Corp. v. Bright Data case judgement:
…invoking state contract and tort law, X Corp. would entrench its own private copyright system that rivals, even conflicts with, the actual copyright system enacted by Congress. X Corp. would yank into its private domain and hold for sale information open to all, exercising a copyright owner's right to exclude where it has no such right.
X Corp. v. Bright Data Ltd., C 23-03698 WHA, 2 (N.D. Cal. May. 9, 2024)
Tools that bypass data scraping restrictions
Irrespective of legal woes and crackdowns, developers have started building third-party tools, which allow users, including researchers, to ‘informally’ scrape data from social media platforms.
These tools avoid the hassle of requiring official API access from platforms by mimicking human browsing behavior to extract data instead of relying on APIs. Since many of these tools offer low-code and no-code solutions to data scraping, they expand the user base of a practice that was in the past only limited to programmers with at least an intermediate understanding of Python and other programming languages. Now the layperson or researcher also has a wealth of data at their fingertips. Additionally, in the case of X, some of these tools allow for access to historic data going back to 2006, which even the official X API does not offer. At the same time, their unofficial status leaves these tools at a constant risk of bans, and open to litigation, making them unreliable options for long-term research.
How platforms can address data scraping responsibly
As such, the most practical, reliable, and legal option would be for platforms to address their duty towards offering researchers above-board methodologies to continue their noble and essential aim of knowledge production. Blanket bans on data scraping are not a one-size-fits-all solution. It is necessary to consider context and intent of use for each instance of scraping. It is true that the open web runs the risk of being ‘spidered’ by massive web crawlers that not only require massive bandwidth and overload servers, but are also being used to train Large Language Models (LLMs) without user and website owner consent. This is and should be a genuine cause of concern for tech companies and other stakeholders. At the same time, as discussed in detail above, data scraping specific sites and pages has time and again proven to be useful, valuable, and in need of protection.
For some skeptics, the problem then is not data scraping, but platforms evading accountability when it comes to contextualising use cases for data scraping. Legitimate, positive use cases can be identified by looking at research objectives, the size of data collection, and the use of appropriate safeguards to protect data and sites. By facilitating researchers with authorised free access for these legitimate use cases, platforms can reduce unofficial scraping on their sites, while also upholding their duties towards the public good. Putting significant amounts of data behind unbreachable paywalls, regardless of its intended use, represents a fundamental failure of platform accountability in supporting essential research and knowledge production.
By Sara Imran, Research Associate, Digital Rights Foundation
Published by: Digital Rights Foundation in Blog
Comments are closed.