A machine learning dataset is a collection of instances that share common characteristics and attributes. It can be a training data set, where data is fed into a machine learning algorithm for training, or a test data set, which is used to evaluate and test a machine learning model.

Machine learning algorithms learn from data by identifying trends, relationships in the data, and making predictions based on large amounts of data provided. Accurate training data ensures accurate performance of machine learning models.

In this article, we will provide some of the best public datasets in machine learning.

1. Bright Data

Bright Data Managed Service Overview

Brightdata also provides public datasets for machine learning. It has over 200 carefully curated datasets that can be used for artificial intelligence training or machine learning. Instead of extracting data yourself, you can easily access these ready-made data sets. The available data covers Amazon, LinkedIn, Instagram, CrunchBase, Zillow Real Estate, Google Maps, X, TikTok, Facebook, Shopee, Indeed, Walmart, YouTube, Glassdoor, Shein and other platforms.

These high-quality datasets are presented in the form of video, images, audio, and text and are carefully curated to fit your needs. Additionally, with Brightdata's solutions, you can easily search, crawl, and interact with the web without fear of being banned. Its system is also optimized for extracting text suitable for LLMs (Large Language Models).

Additionally, with Brightdata, you can discover relevant data sources for any query, crawl pages, extract content, and get output suitable for LLM. It’s also very convenient to run an AI agent on a fully managed remote browser. Fortunately, with Brightdata, you have access to unified structured and unstructured data, as well as historical and real-time data, simplifying the development of machine learning models.

FEATURES

  • Get clean data with a single API call.
  • Deploy dedicated data pipelines for your AI applications and agents.
  • Retrieve data from large web archives with billions of HTML pages.
  • Discover URLs to videos and pictures, as well as text in over 100 languages.
  • Leverage the BrightData model context protocol to enhance your AI models and agents.
  • Brightdata supports hosted and self-hosted MCP configurations via SSE, MCP or Node.js installation.
  • Output format: JSON, Excel, CSV, Parquet, custom.
  • price

  • Datasets – Starting at $2.50 per 1,000 records – 100,000 records package.
  • 2. Kaggle

    Bright Data Managed Service Overview

    Kaggle has a vast library of public datasets ideal for use with machine learning. You can filter based on the type of dataset you want to view, such as computer science, education, classification, computer vision, natural language processing (NLP), data visualization, pre-trained models, etc. You can also choose based on the most relevant or popular data sets at the moment.

    The website is very detailed; for each dataset you get a vivid description of what it contains, what can be achieved with it, and who will benefit most from it. Additionally, you can learn about the dataset’s authors, collaborators, coverage, citations, and other important details.

    Kaggle provides related machine learning models, competitions and discussions. In Contests, you can start a contest or participate in one to see if you have what it takes. It is one of the most interactive platforms providing public datasets for machine learning.

    FEATURES

  • Download via Kagglehub, Kaggle CLI, cURL or croissant.
  • You can also download the dataset as a zip file or export the metadata to croissant format.
  • Provide a detailed description of the dataset and information about its contributors.
  • Ability to access data through code.
  • price

  • Based on MIT
  • 3. UC Irvine Machine Learning Repository

    Bright Data Managed Service Overview

    UC Irvine Machine Learning Repository is another ideal platform with a wide variety of public datasets. You can download these datasets or contribute your own. For each dataset, you can get information about its characteristics, attribute types, subject areas, instances, related tasks, features, variable tables, and creators.

    Additionally, once logged in, you can easily rate the dataset. The forms of data sets include images, multivariate, serialized, spatiotemporal, tabular, text, and time series. These datasets cover a variety of disciplines including biology, business, climate, environment, engineering, games, health and medicine, law, physics, chemistry, and social sciences.

    In addition, you can filter based on keywords, attributes, data types, subject areas, tasks, examples, features, attribute types, and Python.

    FEATURES

  • Allows downloading or uploading datasets.
  • Each dataset is described in detail to help users make informed decisions.
  • Easy to use platform.
  • price

  • based on license agreement
  • 4. Registry of Open Data on AWS

    Bright Data Managed Service Overview

    AWS Open Data Registry (Registry of Open Data on AWS) provides a registry to help people discover and share data sets available through AWS resources. It allows users to easily add datasets or examples of how to use datasets to the registry. Additionally, the provided datasets are not provided or maintained by AWS, but are provided by third parties. Therefore, users need to examine each dataset and determine how best to use it, what is and is not allowed, and the associated license agreement.

    The AWS Open Data Registry also welcomes those with projects related to listed datasets, which can be featured as projects in blog posts. For each dataset, you can get information about the license, update frequency, governance, documentation, how to cite, contacts, publications, tools and applications, and usage examples.

    FEATURES

  • Has a vast library of public datasets for machine learning.
  • Provide detailed descriptions and usage examples of specific datasets.
  • Ability to add datasets to the dataset registry.
  • Provide tools and services to help analyze and process data.
  • price

  • based on license agreement
  • 5. Microsoft Azure Open Datasets

    Bright Data Managed Service Overview

    If you are looking for public datasets for machine learning, you can also consider Microsoft Azure Open Datasets. You can use these datasets in machine learning workflows and improve prediction accuracy. Plus, it’s easy to share datasets with a growing community of data scientists and developers. You can also learn how to use open datasets to train machine learning models.

    FEATURES

  • Has a vast library of public datasets for machine learning.
  • A range of open licenses are available that you can apply to your datasets.
  • You need to have an Azure account to use these open datasets.
  • price

  • There are no additional fees for using the open data sets themselves. You only pay for the Azure services consumed when using the open data sets.
  • 6. OpenML

    Bright Data Managed Service Overview

    OpenML is a global machine learning laboratory. It enables users to easily access machine learning research and reuse it as needed. OpenML is a platform for users to share and access data sets, algorithms, and experiments. All datasets are uniformly formatted with coherent metadata and can be easily loaded directly into your favorite work environment.

    Additionally, pipelines and models can be shared directly from your favorite machine learning libraries. At the same time, it is very easy to learn from millions of reproducible machine learning experiments. OpenML keeps track of exactly which datasets and library versions were used.

    As a machine learning expert, you can easily share your work; data owners can share their data to challenge and collaborate with the machine learning community; and algorithm developers can integrate your tools with OpenML for easy import and export of data and experiments.

    FEATURES

  • AI-ready data.
  • Machine learning library integration.
  • Importing and exporting datasets, pipelines, and experiments is easy.
  • Machine learning data is well organized.
  • Can be easily downloaded in XML, JSON and croissant formats.
  • price

  • based on license agreement
  • 7. Sigma AI open datasets

    Bright Data Managed Service Overview

    Sigma AI Open Datasets provides a collection of free, open source datasets that you can use for machine learning experiments and projects. When you contact them, you are also free to add public datasets for machine learning to the database.

    Finding datasets on the platform is not complicated; you just click on an entry, filter based on various parameters, and search the dataset based on a certain word or phrase. Once completed, download the CSV file in the lower right corner.

    FEATURES

  • Searching and downloading datasets is very easy.
  • Can be downloaded in CSV file format.
  • Supports over 600 languages.
  • price

  • Dataset - free, but customization is available
  • 8. Allen AI Open datasets for machine learning

    Bright Data Managed Service Overview

    AllenAI has a vast database of public datasets for training artificial intelligence and machine learning. By accessing this data, users can understand how the best models work and how to improve them to make them more useful.

    Fortunately, all datasets were obtained ethically and are safe to use. On the Hugging Face platform, you can view the collection of data sets and team members. You can browse to see the latest updates and access datasets based on topics of interest.

    AllenAI provides language models, multi-modal models, evaluation frameworks and open data sets. Its diversity makes it a go-to site for many people. Some of these datasets include WildChat, S2ORC, Self-instruct, Kiwi, Chime, Drop, Qasper, etc.

    FEATURES

  • Has a vast library of public datasets for machine learning.
  • Data is ethically sourced and safe to use.
  • Website navigation is very easy.
  • Have a reliable community that you can collaborate with.
  • price

  • based on license agreement
  • community based
  • 9. Data Gov Open Data

    Bright Data Managed Service Overview

    Data.gov has over 318,500 available datasets. You can filter by most viewed, recently added, datasets by organization, or geospatial data. Through these categories, you can easily find the data set you want. Data.gov is a U.S. resource data center that was launched in 2009 with just 47 datasets. Over time, the number of datasets has grown to more than 300,000.

    The main goal of this open data website is to ensure that this valuable data is easily accessible. It covers categories such as local government, climate, seniors, energy, Arctic, water resources, human health, ecosystems, transportation, food resilience, and more. You can use this data to conduct research, develop web and mobile applications, design data visualizations, and more.

    FEATURES

  • Data sets are clearly categorized and easily accessible.
  • Provides U.S.-based resources and data.
  • Anyone can access the platform and exploit the data as long as they comply with the terms of use.
  • Its records are ethically sourced.
  • The filtering system and classification are top notch.
  • price

  • Public access and use
  • 10. Datarade.Ai

    Bright Data Managed Service Overview

    Datarade.ai is another platform where you can obtain public datasets for machine learning or artificial intelligence training. It all depends on the data you want to collect. It has an immersive search bar that allows you to search for any dataset type you want, such as machine learning datasets. On each dataset, there is a free sample preview that allows users to check the contents of the dataset before purchasing.

    You can easily filter by free samples, attributes, data providers, country coverage, categories and delivery methods. You can get datasets through S3 buckets, email, SFTP, REST API, UI export, Feed API, SOAP API, streaming API, compressed files, Azure Blob Storage, Google Cloud Storage, Google BigQuery, Snowflake shares, Databricks Delta shares, FIX API, WebSocket, etc.

    FEATURES

  • Has a huge library of machine learning data sets.
  • Various data sets are vividly described.
  • Multiple delivery methods available.
  • price

  • Datasets – Customized Pricing.
  • Based on the license agreement.
  • 11. Meta AI

    Bright Data Managed Service Overview

    Meta AI also provides a large number of data sets and benchmarks for training, evaluating and testing artificial intelligence and machine learning models to promote progress in related fields. Its dataset types are rich and diverse, including FACET, Ego TV dataset, MMCSG dataset, speech fairness dataset, daily conversations, common objects in 3D, segment everything, DISC21 dataset, Ego Objects dataset, Flores benchmark dataset, Ego4d, etc., and many more. It depends on what you're doing and the resources you need.

    FEATURES

  • Has a huge database of data sets.
  • Its goal is to ensure good collaboration and accelerate the development of artificial intelligence and machine learning.
  • Demos are available for users who want to experience the latest research breakthroughs first-hand.
  • price

  • Subscription based model
  • The End

    Most machine learning data sources provide rich and diverse data, making it easy to get the data you need in real time. The data comes mainly from various fields and industries, resulting in various variables.

    Additionally, most public dataset websites for machine learning are very user-friendly, making it easy for users, developers, and researchers alike to find what they need. Additionally, most sites offer community support where people can participate in discussions, learn from others' experiences, and get help with projects.