Cybersixgill recently observed a member of a popular cybercrime forum advertising the data of over 2.6 million users of the Duolingo app. The personal information was harvested using an exposed API that allows anyone to retrieve users’ profile data and confirm whether email addresses are associated with Duolingo accounts. In addition to the ad for the Duolingo data, Cybersixgill also detected the same threat actor selling leaked data of other well-known platforms.
In August 2023, Cybersixgill observed a member of a leading English-language cybercrime forum advertising user data scraped from Duolingo, one of the world’s top language-learning platforms. The data allegedly relates to 2,698,306 Duolingo users and contains logins, names, email addresses, and other internal information related to the platform. While some of the data is public, other profile information is private and could be used in phishing attacks or combined with additional data in identity theft and financial fraud schemes.
According to open source (OSINT) news reports, threat actors scraped the leaked Duolingo data using an exposed application programming interface (API) that circulated in March 2023 on the clear web. While the API has legitimate research uses, threat actors can also use it to (1) determine whether email addresses are associated with Duolingo accounts, and (2) retrieve JSON output that contains users’ public profile data.
Threat actors reportedly used the API to feed the scraper millions of email addresses, which were likely sourced from earlier breaches. After confirming that addresses corresponded with Duolingo accounts, threat actors used the emails to generate the dataset of public and non-public user information. According to an OSINT news source, a threat actor advised those seeking to leverage the data in phishing attacks to monitor specific fields in results that signal higher levels of permissions for certain Duolingo users, since these accounts could be used for further malicious operations, such as device compromise and malware deployment.
In January 2023, Cyberixgill observed a different threat actor on a now-shuttered forum advertising the scraped data of 2.6 million Duolingo users for a price of $1,500. While Cybersixgill did not observe any overlaps linking the two threat actors selling Duolingo data, it is possible that (1) it is the same data set, and (2) it is the same threat actor.
Historically, corporations have downplayed the value of scraped data based on the fact that it’s publicly accessible and ostensibly poses less of a risk. In reality, scraped data can be combined with private info sourced from other locations to launch phishing campaigns and other social engineering attacks. Scraped content is thus far from harmless, which is evidenced by interest in this content on the underground among threat actors.
Cybersixgill collected the forum post (Figure 1) on which the threat actor first advertised the Duolingo dataset in August 2023. While victim initially omitted the price for the data, it was later updated to a significantly lower price than was demanded by the January 2023 seller.
In the post below, the data’s description used the exact same 34 fields used to advertise the January 2023 Duolingo data, which indicates that they might have used the same API and scraper and suggests these might be the same data sets. With that being said the two sellers used different accounts and different contact methods
Figure 1: Duolingo data advertised on a cybercrime forum
Turning to the seller’s other activity on the forum, this member has advertised data sets from other major breaches. This includes a 400,000-entry dataset from a 2022 breach of a gaming support website advertised in the post below (Figure 2). According to the seller, up to 2.6 million users were affected by the breach, the same number of victims listed in the Duolingo breach. The price was $5,000 but appeared to be negotiable.
In addition to leaked data, the seller posted on the forum about other scams and financial fraud. In general, the seller appears to be a highly active English-speaking threat actor heavily involved in the forum. The threat actor’s profile and other activity suggest that the Duolingo content is authentic and may be the same data posted in January 2023. Indeed, threat actors frequently reshare and resell data from previous breaches on the underground.
Figure 2: The Duolingo seller advertises data from another breach
To reiterate, the type of publicly accessible data leaked from Duolingo can pose a significant threat when combined with private data from other sources. Specifically, this combination can be used for phishing campaigns and other social engineering attacks, which is why it is bought and sold by cybercriminals.
In view of the demand for such information on underground markets and forums, and the threat posed by phishing attacks, all organizations should instruct employees not to click on links or attachments in suspicious emails. Specifically, users should double-check email senders’ identities before opening attachments or clicking links. They should also remain vigilant with regard to misspelled URLs to avoid entering credentials into fraudulent websites. Finally, organizations should instruct personnel to exercise additional caution when using MFA codes for corporate services.
 Scraping involves extracting data from websites or online platforms using its API to retrieve specific information, with the API responding with the requested data in JSON, XML, or other formats.
 Duolingo is a U.S.-based educational technology company that provides learning apps and language certification, serving over 74 million users each month worldwide.
 An Application Programming Interface (API) is a set of protocols and tools that enable communication between various software systems. These interfaces are widely utilized by various applications and computer systems to facilitate the exchange of data, allowing external partners or online apps to access internal information when valid authentication credentials are presented.