Common Crawl Foundation
- United States
- commoncrawl.org
- Ngo
About us
Common Crawl is a nonprofit 501(c)(3) organisation that crawls the web and freely provides its archives and datasets to the public. Founded in 2007 by Gil Elbaz, we maintain an open repository of web crawl data collected since 2008, totalling more than 10 petabytes. Crawls are published approximately once a month, each typically containing more than two billion web pages.
The dataset is hosted on Amazon Web Services through its Open Data Sponsorship Program and can be downloaded at no cost. It has been cited in over 12,000 research papers and has become one of the most widely used sources of training data for large language models.
Common Crawl is a member of the International Internet Preservation Consortium (IIPC) and a partner in the End of Term Web Archive, which preserves US federal government websites during presidential transitions.
Datasets
1 Dataset
| CommonLID | common-crawl-tou | mlu | NLP | tsv | 59.38 MB |