In this guest post, Avi Chawla, Founder of Daily Dose of Data Science and author of AIport, spotlights Yambda-5B – a rare, production-scale recommender dataset newly open to public research – and shows how it complements classic datasets like MovieLens, Amazon, and Spotify by addressing limitations in scale, modality, and evaluation. | | Recommender systems thrive on data. | However, the data used in academic research often looks nothing like the data that fuels real-world recommenders since that data sits locked inside companies, due to both business value and privacy concerns. | Yandex recently published its 5B event dataset, Yambda-5B, on Hugging Face, making it publicly available to anyone working on recommender algorithms, so I decided to provide a short overview of recommender system datasets openly available to researchers and developers. Below are the most noteworthy datasets in this field. | | Over the years, the RecSys community has relied on a handful of public datasets (depicted above) as benchmarks. Each has contributed to research progress, but each comes with limitations. For instance: | MovieLens: Contains user-provided movie ratings (1–5 stars) with timestamps. Its small scope (~10k movies total) made it great for early studies, but it’s not representative of industrial-scale catalogs. Netflix Prize: ~100M movie ratings from Netflix’s 2006–09 recommendation challenge. Despite its historic role in advancing recommender research, it covers only ~17k movies and uses only coarse date timestamps. It’s also a one-time snapshot from the mid-2000s with no updates. Yelp Open: 8.6M reviews of local businesses (restaurants, shops, etc.) by 2.2M users. It’s useful for experiments, but the data is extremely sparse and limited to a few cities. No standard train/test split is provided (researchers devise their own evaluation schemes). Last.fm (LFM-1B): Approximately 1B music listening events (“scrobbles”) from the LastFM online music service. It was once widely used for music recommendation research. However, due to licensing restrictions, LFM-1B (and an even larger LFM-2B version) is no longer publicly accessible. Criteo 1TB: A terabyte of ad click logs (over 4 billion interactions). This dataset reflects true industry scale and is used to train click-through rate models. But it doesn’t resemble typical recommendation data: it has no meaningful user or item metadata (only hashed IDs) and no timestamps. Spotify Million Playlist: 1 million user-generated playlists (~66 million track entries) released for the RecSys Challenge 2018. This dataset is excellent for studying short-term preferences and sequence modeling, but it doesn’t include long-term user histories or any explicit feedback. Amazon Reviews: 200M+ product reviews from Amazon across many categories. This dataset is rich in content and has been used for product recommendation and sentiment analysis research. But it’s extremely sparse and has a long-tail distribution, i.e., most users and products only have a few interactions.
| Yambda-5B mitigates these challenges by offering researchers large-scale, anonymized data from Yandex’s music streaming service, that includes parameters like the is_organic flag and Global Temporal Split (GTS) evaluation. | Let’s examine the problems this dataset can help solve. | Problem 1: Lack of real-world datasets | Modern internet platforms log billions of user interactions every year, far beyond the size of classic academic datasets. | An algorithm that looks SOTA on a million-rating dataset might break or underperform when faced with a billion-event stream: | | Yambda-5B contains 4.79 billion user-item interactions, which is orders of magnitude more data than MovieLens or Netflix. | And despite being extremely large, the dataset is accessible to different research budgets because Yandex released multiple versions – a 50 million interaction sample, a 500 million sample, and the full 5 billion, so you can start small and scale up as needed. | | Problem 2: Privacy | Sharing real user behavior logs is tricky, even if you anonymize user IDs, since people can sometimes be re-identified by just a few unique preferences. | A famous example is the Netflix Prize dataset of 100 million movie ratings: it was released for a competition, but researchers showed it was possible to de-anonymize and identify individual users by matching their ratings with public IMDb reviews. | | Netflix even canceled a follow-up contest in 2010 after a privacy lawsuit highlighted these risks: | | Yambda-5B is also special in this aspect that, unlike Netflix, this dataset contains no publicly accessible listening histories and likes, making it inherently resistant to de-anonymization. | It is rigorously safeguarded, so there’s zero risk of sensitive data exposure. | More key features | | Importantly, Yambda includes both implicit feedback (song listens, skips) and explicit feedback (track “likes” or “dislikes”), so models can learn from both passive behavior and active preferences. | Each interaction is labeled with an is_organic flag indicating whether the play was an organic user action or triggered by the recommendation engine: | | This lets researchers separate natural listening behavior from recommendation-driven behavior, which is crucial for evaluating algorithmic impact. | Unlike most older datasets, Yambda provides precise timestamps for all events and comes with a global temporal split for model evaluation, which lets us train on earlier interactions and test on a held-out set of later interactions: | | Evaluating on this time-based split (rather than random hold-outs) gives a more realistic measure of how a model might perform in an online setting. | Another unique aspect is that Yambda is multi-modal: it ships with precomputed audio embeddings for over 7.7 million tracks, enabling content-aware recommendation strategies out of the box. | The release even includes baseline models and evaluation code, with metrics like NDCG@K and Recall@K reported, to help researchers get started and compare methods on a standard benchmark. | Conclusion | Historically, we haven’t had many large-scale open datasets, which made it challenging to benchmark algorithms intended for real-world use. | Yandex’s Yambda-5B is a significant step toward bridging that gap, offering a web-scale dataset that academia can freely explore. | If you’re interested in exploring Yambda-5B yourself, it’s available on Hugging Face here: Hugging Face Yambda dataset. | | With resources like this becoming available, we can move closer to recommender models that truly translate from paper to production. | Thanks for reading! | | *This post was written by Avi Chawla, Founder of Daily Dose of Data Science and AIport author, specially for Turing Post. |
|