Skip to content

feat: split raw data storage into per-dataset SQLite files#8219

Draft
astafan8 wants to merge 4 commits into
microsoft:mainfrom
astafan8:feature/split-raw-data-sqlite
Draft

feat: split raw data storage into per-dataset SQLite files#8219
astafan8 wants to merge 4 commits into
microsoft:mainfrom
astafan8:feature/split-raw-data-sqlite

Conversation

@astafan8

Copy link
Copy Markdown
Contributor

Summary

This PR implements split raw data storage for QCoDeS: an opt-in feature that writes raw measurement data (results table rows) into individual per-dataset SQLite files while keeping all metadata in the main database. The goal is to prevent the main DB file from growing excessively large as datasets accumulate, making metadata browsing and experiment management faster.

Motivation

The main QCoDeS SQLite database stores both metadata (experiments, runs, parameter layouts, dependencies) and raw measurement data (results tables) in a single file. Over time, this file can grow to many gigabytes, slowing down operations that only need metadata. By splitting the raw data into per-dataset files, the main DB stays lightweight while data integrity is preserved.

Design Decisions

Architecture: transparent routing via _data_conn property

  • A single _data_conn property on DataSet is the routing point for all data read/write operations
  • Returns the per-dataset raw data connection when split is enabled, otherwise falls back to self.conn (main DB)
  • All write paths (add_results, _BackgroundWriter) and read paths (get_parameter_data, DataSetCacheWithDBBackend, number_of_results, __len__) go through this property
  • Zero changes to public DataSet API — all existing methods work identically

Config: follows existing export path pattern

  • Two new config options in dataset section of qcodesrc.json:
    • raw_data_to_separate_db (bool, default false)
    • raw_data_path (string, default "{db_location}")
  • Reuses _expand_export_path() from export_config.py for path expansion (e.g., ~/experiments.db~/experiments_db/)
  • Pattern mirrors the existing export_path / export_type config approach

Per-dataset files: lightweight, GUID-named

  • Each file is named <guid>.db and contains only the results table + numpy type adapters
  • No QCoDeS metadata schema in per-dataset files — they are minimal
  • Path to per-dataset file is persisted in run metadata (raw_data_db_path dynamic column) for automatic reconnection on load_by_id()

Empty results table kept in main DB

  • We considered removing the results table from the main DB entirely, but this would break:
    • _Subscriber trigger creation (SQLite triggers require the table to exist)
    • __len__ / number_of_results before dataset is started (when raw data DB doesn't yet exist)
    • Low-level query functions that inspect table structure via PRAGMA TABLE_INFO
    • The _check_if_table_found logic used in _get_datasetprotocol_from_guid to distinguish DataSet vs DataSetInMem
  • Decision: keep the empty table schema (column definitions, no rows) — negligible overhead, full backward compatibility

get_parameter_data bypass for raw data DB

  • The standard get_parameter_data() in queries.py calls get_rundescriber_from_result_table_name() which queries the runs table — this table doesn't exist in the raw data DB
  • Solution: when _raw_data_conn is set, bypass the top-level function and call get_shaped_parameter_data_for_one_paramtree() directly with the already-held rundescriber

Subscriber triggers on data connection

  • _Subscriber.__init__ creates SQL triggers for real-time data callbacks
  • Changed to use _data_conn instead of self.conn so triggers fire on the correct DB where data is actually inserted

BackgroundWriter support

  • _BackgroundWriter maintains a _raw_data_conns dict keyed by file path
  • Queue items include optional raw_data_path key for routing
  • Connections are lazily created and reused across datasets sharing the same raw data DB path

Files Changed

File Type Description
src/qcodes/dataset/raw_data_storage.py New Helper module: is_raw_data_storage_enabled(), get_raw_data_folder(), get_raw_data_db_path(), connect_to_raw_data_db(), create_raw_data_db()
tests/dataset/test_raw_data_storage.py New 19 tests (7 unit + 10 integration + 2 non-interference)
src/qcodes/dataset/data_set.py Modified _data_conn property, _raw_data_conn attribute, routing in __init__, _perform_start_actions, add_results, get_parameter_data, number_of_results, __len__, _BackgroundWriter, _get_datasetprotocol_from_guid
src/qcodes/dataset/data_set_cache.py Modified load_data_from_db() uses _data_conn
src/qcodes/dataset/subscriber.py Modified Trigger creation uses _data_conn
src/qcodes/configuration/qcodesrc.json Modified Added config defaults
src/qcodes/configuration/qcodesrc_schema.json Modified Added config schema
docs/dataset/introduction.rst Modified New "Split Raw Data Storage" section
docs/dataset/dataset_design.rst Modified Design notes on split storage
docs/examples/DataSet/Database.ipynb Modified Config and usage documentation

Verification

Own tests: 19/19 pass

  • Unit tests for all helper functions (config reading, path generation, DB creation)
  • Integration tests: write/read round-trip, cache, load_by_id, multiple datasets, background writer
  • Non-interference: feature disabled → data goes to main DB as before

Full test suite with feature globally enabled: 1031 passed, 5 failed, 35 skipped

The 5 failures are all expected/explained:

Test Reason
test_raw_data_conn_is_none Own test for disabled mode — overridden by global enable
test_data_in_main_db Own test for disabled mode — overridden by global enable
test_get_parameter_data Low-level query test calls queries.get_parameter_data(ds.conn, ...) directly, bypassing DataSet
test_get_parameter_data_independent_parameters Same as above
test_get_run_attributes Metadata assertion expects exact {'foo': 'bar'} but split adds raw_data_db_path

Full test suite with feature disabled (default): all pass unchanged

Code quality

  • Ruff lint: ✅ all checks passed
  • Pyright type check: ✅ 0 errors, 0 warnings

Add optional configuration to write results-table data into individual
per-dataset SQLite files (<guid>.db) while keeping all metadata in the
main database. This keeps the main DB lightweight as it grows.

Config options (dataset section of qcodesrc.json):
  - raw_data_to_separate_db (bool, default false)
  - raw_data_path (string, default '{db_location}')

Implementation:
  - New module: qcodes.dataset.raw_data_storage (helper functions)
  - DataSet._data_conn property routes reads/writes to correct DB
  - BackgroundWriter supports per-dataset raw data connections
  - Subscriber triggers created on data connection for compatibility
  - Per-dataset DB path persisted in run metadata for auto-reconnect
  - Empty results table schema kept in main DB for compatibility
  - 19 new tests, all existing tests pass unchanged
  - Documentation added to dataset intro, design docs, and Database notebook

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.79104% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.11%. Comparing base (c81021e) to head (d897d95).
⚠️ Report is 63 commits behind head on main.

Files with missing lines Patch % Lines
src/qcodes/dataset/data_set.py 87.50% 6 Missing ⚠️
src/qcodes/dataset/_raw_data_storage.py 94.04% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8219      +/-   ##
==========================================
- Coverage   71.02%   70.11%   -0.91%     
==========================================
  Files         301      302       +1     
  Lines       31888    32014     +126     
==========================================
- Hits        22647    22447     -200     
- Misses       9241     9567     +326     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/qcodes/dataset/data_set.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment thread src/qcodes/dataset/_raw_data_storage.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread src/qcodes/dataset/_raw_data_storage.py
Comment thread src/qcodes/dataset/data_set.py Outdated
Comment thread tests/dataset/test_raw_data_storage.py
Mikhail Astafev and others added 3 commits June 30, 2026 10:04
- Rename raw_data_storage.py to _raw_data_storage.py (private module)
- Raise FileNotFoundError when per-dataset raw data file is missing
  on load instead of silently falling back to empty main DB table
- Quote column names in create_raw_data_db to handle SQL keyword names
- Close both _raw_data_conn and conn in all tests to prevent leaked
  file handles
- Add test_missing_raw_data_file_raises to verify the error behavior
- Fix import ordering (ruff I001) and raw regex pattern (ruff RUF043)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement a helper function that updates the raw_data_db_path metadata
in the main database when individual raw data SQLite files have been
moved to a new location. This mirrors the existing pattern used for
exported netCDF files.

The function:
- Scans all datasets in the main DB that have raw_data_db_path metadata
- For each, checks if the corresponding .db file exists in the new folder
- Updates the stored path in the metadata to point to the new location
- Reports which datasets were updated and which were skipped (file missing)

Exposed as qcodes.dataset.update_raw_data_paths() for user convenience.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@astafan8

Copy link
Copy Markdown
Contributor Author

Added: update_raw_data_paths helper

Added a utility function for users who move their per-dataset raw data files to a new location. This mirrors the pattern used for exported netCDF files.

Usage:

`python
from qcodes.dataset import update_raw_data_paths

update_raw_data_paths(
db_path="/path/to/main_database.db",
new_raw_data_folder="/new/location/of/raw_files/"
)
`

The function:

  • Scans all datasets in the main DB that have raw_data_db_path metadata
  • For each, checks if the corresponding .db file exists in the new folder
  • Updates the stored path to point to the new location
  • Logs warnings for any files not found in the new folder

4 tests added, documentation updated in introduction.rst and Database.ipynb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants