feat: split raw data storage into per-dataset SQLite files by astafan8 · Pull Request #8219 · microsoft/Qcodes

astafan8 · 2026-06-12T15:50:23Z

Summary

This PR implements split raw data storage for QCoDeS: an opt-in feature that writes raw measurement data (results table rows) into individual per-dataset SQLite files while keeping all metadata in the main database. The goal is to prevent the main DB file from growing excessively large as datasets accumulate, making metadata browsing and experiment management faster.

Motivation

The main QCoDeS SQLite database stores both metadata (experiments, runs, parameter layouts, dependencies) and raw measurement data (results tables) in a single file. Over time, this file can grow to many gigabytes, slowing down operations that only need metadata. By splitting the raw data into per-dataset files, the main DB stays lightweight while data integrity is preserved.

Design Decisions

Architecture: transparent routing via `_data_conn` property

A single _data_conn property on DataSet is the routing point for all data read/write operations
Returns the per-dataset raw data connection when split is enabled, otherwise falls back to self.conn (main DB)
All write paths (add_results, _BackgroundWriter) and read paths (get_parameter_data, DataSetCacheWithDBBackend, number_of_results, __len__) go through this property
Zero changes to public DataSet API — all existing methods work identically

Config: follows existing export path pattern

Two new config options in dataset section of qcodesrc.json:
- raw_data_to_separate_db (bool, default false)
- raw_data_path (string, default "{db_location}")
Reuses _expand_export_path() from export_config.py for path expansion (e.g., ~/experiments.db → ~/experiments_db/)
Pattern mirrors the existing export_path / export_type config approach

Per-dataset files: lightweight, GUID-named

Each file is named <guid>.db and contains only the results table + numpy type adapters
No QCoDeS metadata schema in per-dataset files — they are minimal
Path to per-dataset file is persisted in run metadata (raw_data_db_path dynamic column) for automatic reconnection on load_by_id()

Empty results table kept in main DB

We considered removing the results table from the main DB entirely, but this would break:
- _Subscriber trigger creation (SQLite triggers require the table to exist)
- __len__ / number_of_results before dataset is started (when raw data DB doesn't yet exist)
- Low-level query functions that inspect table structure via PRAGMA TABLE_INFO
- The _check_if_table_found logic used in _get_datasetprotocol_from_guid to distinguish DataSet vs DataSetInMem
Decision: keep the empty table schema (column definitions, no rows) — negligible overhead, full backward compatibility

`get_parameter_data` bypass for raw data DB

The standard get_parameter_data() in queries.py calls get_rundescriber_from_result_table_name() which queries the runs table — this table doesn't exist in the raw data DB
Solution: when _raw_data_conn is set, bypass the top-level function and call get_shaped_parameter_data_for_one_paramtree() directly with the already-held rundescriber

Subscriber triggers on data connection

_Subscriber.__init__ creates SQL triggers for real-time data callbacks
Changed to use _data_conn instead of self.conn so triggers fire on the correct DB where data is actually inserted

BackgroundWriter support

_BackgroundWriter maintains a _raw_data_conns dict keyed by file path
Queue items include optional raw_data_path key for routing
Connections are lazily created and reused across datasets sharing the same raw data DB path

Files Changed

File	Type	Description
`src/qcodes/dataset/raw_data_storage.py`	New	Helper module: `is_raw_data_storage_enabled()`, `get_raw_data_folder()`, `get_raw_data_db_path()`, `connect_to_raw_data_db()`, `create_raw_data_db()`
`tests/dataset/test_raw_data_storage.py`	New	19 tests (7 unit + 10 integration + 2 non-interference)
`src/qcodes/dataset/data_set.py`	Modified	`_data_conn` property, `_raw_data_conn` attribute, routing in `__init__`, `_perform_start_actions`, `add_results`, `get_parameter_data`, `number_of_results`, `__len__`, `_BackgroundWriter`, `_get_datasetprotocol_from_guid`
`src/qcodes/dataset/data_set_cache.py`	Modified	`load_data_from_db()` uses `_data_conn`
`src/qcodes/dataset/subscriber.py`	Modified	Trigger creation uses `_data_conn`
`src/qcodes/configuration/qcodesrc.json`	Modified	Added config defaults
`src/qcodes/configuration/qcodesrc_schema.json`	Modified	Added config schema
`docs/dataset/introduction.rst`	Modified	New "Split Raw Data Storage" section
`docs/dataset/dataset_design.rst`	Modified	Design notes on split storage
`docs/examples/DataSet/Database.ipynb`	Modified	Config and usage documentation

Verification

Own tests: 19/19 pass

Unit tests for all helper functions (config reading, path generation, DB creation)
Integration tests: write/read round-trip, cache, load_by_id, multiple datasets, background writer
Non-interference: feature disabled → data goes to main DB as before

Full test suite with feature globally enabled: 1031 passed, 5 failed, 35 skipped

The 5 failures are all expected/explained:

Test	Reason
`test_raw_data_conn_is_none`	Own test for disabled mode — overridden by global enable
`test_data_in_main_db`	Own test for disabled mode — overridden by global enable
`test_get_parameter_data`	Low-level query test calls `queries.get_parameter_data(ds.conn, ...)` directly, bypassing DataSet
`test_get_parameter_data_independent_parameters`	Same as above
`test_get_run_attributes`	Metadata assertion expects exact `{'foo': 'bar'}` but split adds `raw_data_db_path`

Full test suite with feature disabled (default): all pass unchanged

Code quality

Ruff lint: ✅ all checks passed
Pyright type check: ✅ 0 errors, 0 warnings

Add optional configuration to write results-table data into individual per-dataset SQLite files (<guid>.db) while keeping all metadata in the main database. This keeps the main DB lightweight as it grows. Config options (dataset section of qcodesrc.json): - raw_data_to_separate_db (bool, default false) - raw_data_path (string, default '{db_location}') Implementation: - New module: qcodes.dataset.raw_data_storage (helper functions) - DataSet._data_conn property routes reads/writes to correct DB - BackgroundWriter supports per-dataset raw data connections - Subscriber triggers created on data connection for compatibility - Per-dataset DB path persisted in run metadata for auto-reconnect - Empty results table schema kept in main DB for compatibility - 19 new tests, all existing tests pass unchanged - Documentation added to dataset intro, design docs, and Database notebook Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

codecov · 2026-06-12T15:58:10Z

Codecov Report

❌ Patch coverage is 91.79104% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.11%. Comparing base (c81021e) to head (d897d95).
⚠️ Report is 63 commits behind head on main.

Files with missing lines	Patch %	Lines
src/qcodes/dataset/data_set.py	87.50%	6 Missing ⚠️
src/qcodes/dataset/_raw_data_storage.py	94.04%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8219      +/-   ##
==========================================
- Coverage   71.02%   70.11%   -0.91%     
==========================================
  Files         301      302       +1     
  Lines       31888    32014     +126     
==========================================
- Hits        22647    22447     -200     
- Misses       9241     9567     +326

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

- Rename raw_data_storage.py to _raw_data_storage.py (private module) - Raise FileNotFoundError when per-dataset raw data file is missing on load instead of silently falling back to empty main DB table - Quote column names in create_raw_data_db to handle SQL keyword names - Close both _raw_data_conn and conn in all tests to prevent leaked file handles - Add test_missing_raw_data_file_raises to verify the error behavior - Fix import ordering (ruff I001) and raw regex pattern (ruff RUF043) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Implement a helper function that updates the raw_data_db_path metadata in the main database when individual raw data SQLite files have been moved to a new location. This mirrors the existing pattern used for exported netCDF files. The function: - Scans all datasets in the main DB that have raw_data_db_path metadata - For each, checks if the corresponding .db file exists in the new folder - Updates the stored path in the metadata to point to the new location - Reports which datasets were updated and which were skipped (file missing) Exposed as qcodes.dataset.update_raw_data_paths() for user convenience. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

astafan8 · 2026-06-30T16:36:23Z

Added: update_raw_data_paths helper

Added a utility function for users who move their per-dataset raw data files to a new location. This mirrors the pattern used for exported netCDF files.

Usage:

`python
from qcodes.dataset import update_raw_data_paths

update_raw_data_paths(
db_path="/path/to/main_database.db",
new_raw_data_folder="/new/location/of/raw_files/"
)
`

The function:

Scans all datasets in the main DB that have raw_data_db_path metadata
For each, checks if the corresponding .db file exists in the new folder
Updates the stored path to point to the new location
Logs warnings for any files not found in the new folder

4 tests added, documentation updated in introduction.rst and Database.ipynb.

jenshnielsen requested a review from Copilot June 24, 2026 12:56

Copilot started reviewing on behalf of jenshnielsen June 24, 2026 12:57 View session

jenshnielsen reviewed Jun 24, 2026

View reviewed changes

Comment thread src/qcodes/dataset/data_set.py Outdated

Copilot AI reviewed Jun 24, 2026

jenshnielsen reviewed Jun 24, 2026

View reviewed changes

Comment thread src/qcodes/dataset/_raw_data_storage.py

jenshnielsen requested a review from Copilot June 24, 2026 13:05

Copilot started reviewing on behalf of jenshnielsen June 24, 2026 13:06 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread src/qcodes/dataset/_raw_data_storage.py

Comment thread src/qcodes/dataset/data_set.py Outdated

Comment thread tests/dataset/test_raw_data_storage.py

Mikhail Astafev and others added 3 commits June 30, 2026 10:04

Add update_raw_data_paths documentation to intro.rst and Database.ipynb

d897d95

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: split raw data storage into per-dataset SQLite files#8219

feat: split raw data storage into per-dataset SQLite files#8219
astafan8 wants to merge 4 commits into
microsoft:mainfrom
astafan8:feature/split-raw-data-sqlite

astafan8 commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astafan8 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

astafan8 commented Jun 12, 2026

Summary

Motivation

Design Decisions

Architecture: transparent routing via _data_conn property

Config: follows existing export path pattern

Per-dataset files: lightweight, GUID-named

Empty results table kept in main DB

get_parameter_data bypass for raw data DB

Subscriber triggers on data connection

BackgroundWriter support

Files Changed

Verification

Own tests: 19/19 pass

Full test suite with feature globally enabled: 1031 passed, 5 failed, 35 skipped

Full test suite with feature disabled (default): all pass unchanged

Code quality

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astafan8 commented Jun 30, 2026

Added: update_raw_data_paths helper

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Architecture: transparent routing via `_data_conn` property

`get_parameter_data` bypass for raw data DB

codecov Bot commented Jun 12, 2026 •

edited

Loading