Task: RAG
Release Date: 6/23/2026
Format: Parquet
Size: 913.32 MB
Share
Parquet export of the Stack2Graph Qdrant vector dataset for the Dart programming language. The archive contains dense/sparse vector shards and manifests.
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlRestrictions/Special Constraints
Use must comply with the source StackOverflow content license and attribution requirements.
Forbidden Usage
Do not use in ways that violate the source content license, privacy expectations, or applicable law.
Ethical Review
Built from publicly released StackOverflow data dumps; users remain responsible for compliant usage.
Intended Use
Semantic retrieval, KG entry-point finding, RAG experiments, and vector search research.
This dataset is the Dart-specific vector shard of the Stack2Graph StackOverflow retrieval corpus. Each Mozilla Data Collective dataset repository contains exactly one language dataset.
It is optimized for dense+sparse retrieval, Qdrant restoration, and embedding-based RAG experiments.
It is used in the Stack2Graph project as the vector counterpart to the language-scoped RDF knowledge graph shards.
See the Stack2Graph repository for more details: https://github.com/tha-atlas/Stack2Graph
dataset_manifest.json
question_metadata_*.parquet
chunk_records_*.parquet
question_records_*.parquet
dataset_manifest.json: language-scoped manifest for this dataset shard.
question_metadata_*.parquet: per-question metadata and retrieval bookkeeping.
chunk_records_*.parquet: chunk-level vector rows when parent-child indexing is enabled.
question_records_*.parquet: question-level vector rows when chunking is disabled or exported alongside chunk data.
The dataset is derived from Stack Overflow questions selected for the Dart programming language. It contains the structured records needed to rebuild the Stack2Graph Qdrant collection for that language.
Coverage scope:
records are retained when they match the Stack2Graph supported language-tag set
this repository contains only the Dart shard
the archive may contain both metadata-only and retrieval-ready vector rows depending on the export mode
Read dataset_manifest.json first and use it as the source of truth for included Parquet files.
Load all Parquet shards for this repository into your vector indexing pipeline.
Rebuild or restore the Qdrant collection stackoverflow_dart_vector.
Preserve attribution and license metadata during downstream export.
You do not need to regenerate embeddings from GraphDB to use this dataset.
In the Stack2Graph repository, you can use the automation script
python -m experiment.load_mdc_datasets_into_services --skip-kg to download dataset artifacts
and prepare the vector database service state automatically.
Typical workflow:
Clone and configure Stack2Graph (.env with HF token and service paths).
Clone and configure Stack2Graph (.env with MDC dataset ids/config and service paths).
Start required local services:
docker compose up -d
Run the loader script:
python -m experiment.load_mdc_datasets_into_services --skip-kg
For manual usage without automation, directly ingest the listed Parquet files into your vector database.
A Stack Overflow question may belong to multiple language shards when tagged with multiple languages.
Embeddings and sparse representations depend on the configured export pipeline and model versions.
As with community-generated data, content may include noise, bias, and temporal drift.
semantic retrieval and reranking
RAG and hybrid retriever experiments
vector database benchmarking and diagnostics
language-scoped developer tooling research
Not a complete mirror of all Stack Overflow content.
Not all export modes include the same row types or chunk layouts.
Best used together with the Stack2Graph retrieval pipeline and Qdrant-compatible tooling.
This dataset inherits Stack Overflow source licensing and attribution requirements. Ensure compliant attribution and redistribution practices in all derived artifacts.
If you use this dataset, cite the Stack2Graph work:
Stack2Graph: A Structured Knowledge Representation of Stack Overflow Data for Retrieval-based Question Answering