Skip to content

make downloader more robust#666

Open
jshook wants to merge 1 commit into
mainfrom
download_defense
Open

make downloader more robust#666
jshook wants to merge 1 commit into
mainfrom
download_defense

Conversation

@jshook

@jshook jshook commented May 14, 2026

Copy link
Copy Markdown
Contributor

This does a couple things to make downloads more defensive:

  1. Recognizes when multiple (dataset,facet) configs point to the same file and
    • for query vectors, allows it
    • for base vectors, throws an error (for now, this is not supported without windowed reads)
    • for ground truth vectors, throws an error
  2. Recognizes when different instances of the same logical (dataset,facet) config points at potentially different paths.
    • This is a hard error. Before, this was a first-finder wins scenario, which is non-deterministic.
    • For scenarios where a single file was shared, make the downloader concurrently safe.
    • Before it was possible for two different threads to use the same physical output buffer, but only with S3.

These are defensive changes which simply cause these types of errors to be surfaced. The bias here is to avoid surprises or non-deterministic results.

@github-actions

Copy link
Copy Markdown
Contributor

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@MarkWolters MarkWolters left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved, but I think we should try to refrain from using hard-coded conventions in the future if possible (see comment).

if (owners.size() <= 1) continue;

boolean involvesBaseOrGt = owners.stream()
.anyMatch(o -> "base".equals(o.facet) || "gt".equals(o.facet));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to make these constants ("base", "gt") configurable? This works for our current naming standards but hardcoding it here enforces a sort of invisible rule that's disconnected from the dataset creation process. I don't think it needs to be changed immediately to merge, but its something we might want to change for future flexibility.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had this come up in a couple places before, and the main issue is that if there is no standard, then we have to wire the facets up everywhere and it lowers the utility of having an access layer altogether.
But what I have done also is allow all common "spellings" of a facet type to be recognized to support cases which are unambiguous.

The same names are used in upstream dataset creation (in some tools) canonically, with one favored as the generally prescribed name. We do need some type of naming consistency for facets that all the tools can agree on. Happy to make a change if we can address these other concerns too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants