Common Crawl Foundation

United States
commoncrawl.org
Ngo

About Common Crawl Foundation

Common Crawl is a nonprofit 501(c)(3) organisation that crawls the web and freely provides its archives and datasets to the public. Founded in 2007 by Gil Elbaz, we maintain an open repository of web crawl data collected since 2008, totalling more than 10 petabytes. Crawls are published approximately once a month, each typically containing more than two billion web pages.

The dataset is hosted on Amazon Web Services through its Open Data Sponsorship Program and can be downloaded at no cost. It has been cited in over 12,000 research papers and has become one of the most widely used sources of training data for large language models.

Common Crawl is a member of the International Internet Preservation Consortium (IIPC) and a partner in the End of Term Web Archive, which preserves US federal government websites during presidential transitions.

Datasets

Common Crawl Foundation

CommonLID

CommonLID is a community-created language identification (LID) benchmark.

License: common-crawl-tou

Locale: mlu

Task: NLP

Format: tsv

Size: 59.38 MB