StackOverflow Vector Dataset - Dart | Mozilla Data Collective

Dart StackOverflow Vector Dataset Datasheet

1. What This Dataset Is

This dataset is the Dart-specific vector shard of the Stack2Graph StackOverflow retrieval corpus. Each Mozilla Data Collective dataset repository contains exactly one language dataset.

It is optimized for dense+sparse retrieval, Qdrant restoration, and embedding-based RAG experiments.

It is used in the Stack2Graph project as the vector counterpart to the language-scoped RDF knowledge graph shards.

See the Stack2Graph repository for more details: https://github.com/tha-atlas/Stack2Graph

2. Repository Layout

dataset_manifest.json
question_metadata_*.parquet
chunk_records_*.parquet
question_records_*.parquet

dataset_manifest.json: language-scoped manifest for this dataset shard.
question_metadata_*.parquet: per-question metadata and retrieval bookkeeping.
chunk_records_*.parquet: chunk-level vector rows when parent-child indexing is enabled.
question_records_*.parquet: question-level vector rows when chunking is disabled or exported alongside chunk data.

3. Data Model And Coverage

The dataset is derived from Stack Overflow questions selected for the Dart programming language. It contains the structured records needed to rebuild the Stack2Graph Qdrant collection for that language.

Coverage scope:

records are retained when they match the Stack2Graph supported language-tag set
this repository contains only the Dart shard
the archive may contain both metadata-only and retrieval-ready vector rows depending on the export mode

4. Recommended Preprocessing

Read dataset_manifest.json first and use it as the source of truth for included Parquet files.
Load all Parquet shards for this repository into your vector indexing pipeline.
Rebuild or restore the Qdrant collection stackoverflow_dart_vector.
Preserve attribution and license metadata during downstream export.

5. Automatic Download And Vector DB Setup

You do not need to regenerate embeddings from GraphDB to use this dataset.

In the Stack2Graph repository, you can use the automation script python -m experiment.load_mdc_datasets_into_services --skip-kg to download dataset artifacts and prepare the vector database service state automatically.

Typical workflow:

Clone and configure Stack2Graph (.env with HF token and service paths).
Clone and configure Stack2Graph (.env with MDC dataset ids/config and service paths).
Start required local services:

docker compose up -d

Run the loader script:

python -m experiment.load_mdc_datasets_into_services --skip-kg

For manual usage without automation, directly ingest the listed Parquet files into your vector database.

6. Quality Notes And Caveats

A Stack Overflow question may belong to multiple language shards when tagged with multiple languages.
Embeddings and sparse representations depend on the configured export pipeline and model versions.
As with community-generated data, content may include noise, bias, and temporal drift.

7. Intended Use

semantic retrieval and reranking
RAG and hybrid retriever experiments
vector database benchmarking and diagnostics
language-scoped developer tooling research

8. Limitations

Not a complete mirror of all Stack Overflow content.
Not all export modes include the same row types or chunk layouts.
Best used together with the Stack2Graph retrieval pipeline and Qdrant-compatible tooling.

9. Licensing And Attribution

This dataset inherits Stack Overflow source licensing and attribution requirements. Ensure compliant attribution and redistribution practices in all derived artifacts.

10. Suggested Citation

If you use this dataset, cite the Stack2Graph work:

Stack2Graph: A Structured Knowledge Representation of Stack Overflow Data for Retrieval-based Question Answering

StackOverflow Vector Dataset - Dart

Description

Specifics

Considerations

Processes

Metadata