Update for iprox downloads. by ypriverol · Pull Request #116 · PRIDE-Archive/pridepy

ypriverol · 2026-07-01T18:41:13Z

This pull request introduces significant improvements to the pridepy tool, focusing on enhanced download performance and expanded protocol support for bulk dataset retrieval, especially from iProX. The main changes add parallel file downloads, per-file HTTP range threading, and robust Aspera (fast transfer) support for iProX datasets, along with corresponding updates to CLI options and documentation.

Major feature additions and enhancements:

Parallel and segmented downloads:

Added support for downloading multiple files in parallel (--parallel-files/-w) and for using multiple threads per file (--threads/-t), allowing faster and more efficient bulk downloads. The total number of concurrent HTTP connections is capped for stability. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Aspera protocol integration for iProX:

Implemented Aspera (--protocol aspera) as a fast transfer option for iProX datasets, including secure password handling and support for parallel Aspera transfers. The CLI and download logic now route iProX downloads via Aspera when requested and credentials are provided. [1] [2] [3] [4]

User interface and documentation updates:

CLI and usage documentation:

Updated CLI options to expose new parallelism and protocol flags, and expanded the documentation (docs/usage.md) with detailed instructions and examples for fast downloads using parallelism and Aspera. Also clarified iProX password handling for Aspera. [1] [2] [3] [4]
Updated repository enumeration documentation to clarify Aspera support and iProX credential requirements.

Internal API and codebase improvements:

Refactoring and extensibility:

Refactored internal APIs to propagate new parallelism and protocol parameters through client, provider, and transport layers, ensuring consistent handling and future extensibility. [1] [2] [3] [4] [5]
Added utility functions and improved error handling for parallel transfers and protocol selection. [1] [2]

These changes collectively make pridepy much more capable for power users and large-scale dataset retrieval, while keeping the tool user-friendly and robust.

Summary by CodeRabbit

New Features
- Added faster downloads with parallel file support and per-file segment threads.
- Added iProX Aspera download support, including credential prompts and environment-variable handling.
- Added a new download protocol option and support for preserving or flattening file paths.
Bug Fixes
- Improved handling of mixed-source ProteomeXchange downloads by skipping unsupported files when needed.
- Added safeguards to keep downloads inside the selected output folder and avoid duplicate or partial-file skips.
Documentation
- Expanded usage docs with examples and guidance for the new download options.

…ownload-px-raw-files (protocol/threads/parallel)

…-files Adds IproxProvider.aspera_download (ascp on port 33001, credentials via ASPERA_SCP_PASS env only) and routes ProteomeXchangeProvider through it when --protocol aspera is selected, gated on iProX-hosted URLs. Plumbs --iprox-user/--iprox-password (env: IPROX_USER/IPROX_ASPERA_PASSWORD) through Client.download_px_raw_files and the CLI. HTTP-parallel remains the default transport.

… host, honor -w) + concurrency cap + hidden password prompt

Faster iProX/ProteomeXchange downloads: expose -w parallel files, wire download-px-raw-files, opt-in iProX Aspera

coderabbitai · 2026-07-01T18:41:33Z

📝 Walkthrough

Walkthrough

This PR adds Aspera-based fast downloads for iProX datasets, introduces parallel_files, download_threads, and protocol parameters across the CLI, Client, and ProteomeXchangeProvider, clamps combined HTTP connection concurrency, and updates documentation and tests accordingly.

Changes

iProX Aspera and parallel download plumbing

Layer / File(s)	Summary
Aspera transfer implementation `pridepy/download/iprox.py`	Adds Aspera connection constants, `_ascp_binary()`, `_aspera_download_one()`, and `aspera_download()` for single and concurrent multi-file transfers via `ascp`, with credential validation, skip-if-exists logic, and failure aggregation.
ProteomeXchange routing and parameter propagation `pridepy/download/proteomexchange.py`	`download_from_accession_or_url` gains `parallel_files`, `download_threads`, `protocol`, and iProX credential parameters; routes to `IproxProvider.aspera_download` for iProX-hosted URLs when `protocol="aspera"`, otherwise forwards concurrency params to `download_files`.
Client API forwarding `pridepy/download/client.py`	`Client.download_px_raw_files` signature extended and delegates new parameters to `ProteomeXchangeProvider`.
HTTP connection concurrency cap `pridepy/download/transport.py`, `pridepy/tests/test_download_resilience.py`	Adds `MAX_TOTAL_HTTP_CONNECTIONS` and clamps `download_threads` when combined with `parallel_files` would exceed it, with a test validating the clamp and warning.
CLI options and wiring `pridepy/pridepy.py`, `pridepy/tests/test_cli_flatten.py`	Adds `--parallel-files/-w` across multiple download commands, refactors `download-px-raw-files` with `protocol`, `download_threads`, `--iprox-user`, and password env/prompt handling; tests updated to assert forwarded kwargs.
Aspera unit tests `pridepy/tests/test_iprox_aspera.py`	New test suite covering ascp command/env construction, credential errors, failure handling, skip logic, traversal safety, and ProteomeXchange aspera routing/flattening.
Documentation `docs/usage.md`	Documents new fast-download options, parallel/segmented download examples, and iProX Aspera usage with password handling guidance.

Estimated code review effort: 4 (Complex) | ~60 minutes

Possibly related PRs

PRIDE-Archive/pridepy#106: Builds directly on the download architecture for MassIVE/JPOST/iProX/ProteomeXchange that this PR extends with Aspera transfers and concurrency plumbing.
PRIDE-Archive/pridepy#93: Both propagate parallel_files through the CLI to Files.* download entrypoints on the same integration surface.
PRIDE-Archive/pridepy#110: Both modify HTTP threading in transport.py, extending/propagating download_threads through download_http_urls.

Suggested reviewers: chakrabandla

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is related to the change, but it is too vague to convey the main update in the PR.	Use a more specific title like "Add parallel and Aspera support for iProX downloads".

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

Copilot

Pull request overview

This PR upgrades pridepy’s bulk-download capabilities by adding across-file parallelism (--parallel-files) and integrating iProX Aspera transfers for faster ProteomeXchange (PX) downloads, with corresponding CLI, transport-layer, and documentation updates.

Changes:

Add --parallel-files/-w plumbing through CLI → client → provider → transport, plus a combined HTTP connection cap.
Add iProX Aspera support (ascp command construction, credential handling, and PX routing when --protocol aspera is selected).
Expand/adjust tests and usage documentation for parallelism and Aspera.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pridepy/tests/test_iprox_aspera.py	New tests for iProX Aspera command/env construction, routing, flattening, and safety checks.
pridepy/tests/test_download_resilience.py	Adds a regression test ensuring the combined connection cap clamps per-file threads.
pridepy/tests/test_cli_flatten.py	Updates CLI tests to assert `parallel_files` default/flag behavior across commands.
pridepy/pridepy.py	Adds CLI flags for `--parallel-files`, PX `--protocol`, and iProX credential handling/prompting.
pridepy/download/transport.py	Introduces `MAX_TOTAL_HTTP_CONNECTIONS` and clamps HTTP concurrency.
pridepy/download/proteomexchange.py	Routes PX downloads to iProX Aspera when requested; supports flattening in that branch.
pridepy/download/iprox.py	Implements iProX Aspera transfer execution (including parallel subprocess runs) and safe destination handling.
pridepy/download/client.py	Propagates new PX download parameters through the client API.
docs/usage.md	Documents new PX CLI options, parallelism guidance, and iProX Aspera password handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    failed: List[str] = []
    workers = max(1, min(parallel_files, len(http_urls)))
+    if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS:
+        clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers)
+        logging.warning(
+            "parallel_files (%d) x download_threads (%d) = %d exceeds the "
+            "combined connection cap of %d; reducing download_threads to %d "
+            "to keep peak HTTP connections bounded.",
+            workers,
+            download_threads,
+            workers * download_threads,
+            MAX_TOTAL_HTTP_CONNECTIONS,
+            clamped_threads,
+        )
+        download_threads = clamped_threads
    if workers > 1:


+    password = os.environ.get("IPROX_ASPERA_PASSWORD")
+    if protocol.lower() == "aspera" and not password:
+        password = click.prompt("iProX Aspera password", hide_input=True)
+


qodo-code-review · 2026-07-01T18:45:18Z

PR Summary by Qodo

Speed up PX/iProX downloads with parallelism and opt-in Aspera support

✨ Enhancement 🐞 Bug fix 🧪 Tests 📝 Documentation 🕐 40+ Minutes

AI Description

• Add across-file parallelism (-w) and per-file HTTP Range threading (-t) for bulk downloads.
• Support iProX Aspera transfers for PX datasets when --protocol aspera and credentials are
 provided.
• Clamp peak HTTP connections and harden Aspera path/host handling; document and test behavior.

Diagram

graph TD
  A["CLI (pridepy.py)"] --> B["Client.download_px_raw_files"] --> C["ProteomeXchangeProvider"]
  C --> D["HTTP/FTP transport"] --> F{{"download.iprox.org (HTTP)"}}
  C --> E["IproxProvider.aspera_download"] --> G{{"ascp subprocess (Aspera)"}}
  D --> H["Concurrency cap (MAX_TOTAL_HTTP_CONNECTIONS)"]

High-Level Assessment

The following are alternative approaches to this PR:

1. Async IO for HTTP downloads (aiohttp)

➕ Better scalability than threads for many small files
➕ Easier global concurrency limiting and cancellation
➖ Higher implementation complexity and refactor cost across transport/provider layers
➖ Multipart Range download and retry logic must be reimplemented carefully

2. Delegate HTTP downloads to aria2c

➕ Mature segmented + parallel downloads with robust retry/resume
➕ Built-in connection limiting and bandwidth controls
➖ Adds an external runtime dependency similar to ascp
➖ Harder to integrate with current progress/reporting and per-provider path logic

3. Introduce a formal Transport strategy interface

➕ Clear separation between protocol selection and provider listing logic
➕ Easier to add future protocols consistently (aspera/globus/s3)
➖ More up-front refactor with little immediate user-visible benefit
➖ Risk of churn across multiple providers and tests

Recommendation: The PR’s approach is appropriate for pridepy’s current architecture: it reuses existing HTTP Range logic, adds a simple global connection cap for safety, and uses the canonical ascp subprocess for Aspera while keeping credentials out of argv. Consider a transport strategy interface later if more protocol-specific behaviors continue to accumulate in providers.

Files changed (9) +805 / -10

Enhancement (4) +306 / -4

client.pyPlumb PX download protocol and concurrency options through Client facade +14/-1

Plumb PX download protocol and concurrency options through Client facade

• Extends Client.download_px_raw_files signature to accept protocol, parallel_files, and per-file download_threads plus optional iProX credentials. Forwards these parameters to ProteomeXchangeProvider for end-to-end handling.

pridepy/download/client.py

iprox.pyAdd iProX Aspera downloader with safe paths, skipping logic, and parallel transfers +134/-0

Add iProX Aspera downloader with safe paths, skipping logic, and parallel transfers

• Introduces an ascp-based Aspera download path for iProX URLs using ASPERA_SCP_PASS (env) and optional parallel_files subprocess concurrency. Uses _safe_join to prevent path traversal and improves skip-if-downloaded checks to avoid directory/zero-byte false skips.

pridepy/download/iprox.py

proteomexchange.pyRoute PX downloads to iProX Aspera when requested and validate host/paths +60/-3

Route PX downloads to iProX Aspera when requested and validate host/paths

• Extends download_from_accession_or_url to accept protocol, concurrency, and iProX credentials. When protocol=aspera, filters exact download.iprox.org hosts, warns on mixed-host datasets, honors flattening via flatten_relative_paths, and delegates to IproxProvider.aspera_download.

pridepy/download/proteomexchange.py

pridepy.pyExpose -w parallel files and PX protocol/credential options in CLI +98/-0

Expose -w parallel files and PX protocol/credential options in CLI

• Adds -w/--parallel-files to multiple download commands and introduces PX-specific options for --protocol, -t/--threads, and --iprox-user. Implements secure password prompting for iProX Aspera when IPROX_ASPERA_PASSWORD is not set.

pridepy/pridepy.py

Bug fix (1) +18 / -0

transport.pyClamp total HTTP connections for parallel files × per-file threads +18/-0

Clamp total HTTP connections for parallel files × per-file threads
• Adds MAX_TOTAL_HTTP_CONNECTIONS and clamps download_threads when parallel_files × download_threads would exceed the cap. Emits a warning explaining the reduction to keep peak HTTP connections bounded.
pridepy/download/transport.py

Tests (3) +428 / -5

test_cli_flatten.pyAdd CLI tests for -w parallel_files and PX protocol/threads plumbing +95/-5

Add CLI tests for -w parallel_files and PX protocol/threads plumbing

• Updates existing CLI flatten tests to assert parallel_files default propagation and adds new cases to validate -w behavior across commands. Adds a PX CLI test asserting protocol, threads, and parallel_files are forwarded correctly.

pridepy/tests/test_cli_flatten.py

test_download_resilience.pyTest connection-cap clamping for parallel HTTP Range downloads +33/-0

Test connection-cap clamping for parallel HTTP Range downloads
• Adds a regression test asserting parallel_files × download_threads is clamped to MAX_TOTAL_HTTP_CONNECTIONS and that a warning is logged. Verifies clamped per-file thread count is consistently applied.
pridepy/tests/test_download_resilience.py

test_iprox_aspera.pyAdd unit tests for iProX Aspera command/env behavior and PX routing +300/-0

Add unit tests for iProX Aspera command/env behavior and PX routing

• Introduces comprehensive tests for Aspera argv construction, env-only password handling, skip logic edge cases, traversal protection, parallel failure aggregation, and ProteomeXchangeProvider routing with exact-host validation and flattening behavior.

pridepy/tests/test_iprox_aspera.py

Documentation (1) +53 / -1

usage.mdDocument PX parallelism flags and iProX Aspera credential handling +53/-1

Document PX parallelism flags and iProX Aspera credential handling

• Adds CLI option documentation for --protocol, -w/--parallel-files, -t/--threads, and --preserve-structure. Includes usage examples for faster downloads and clarifies iProX Aspera password handling via env var or secure prompt.

docs/usage.md

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (5)

docs/usage.md (2)
253-257: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider noting the internal concurrency cap.

transport.download_http_urls clamps download_threads when parallel_files × download_threads exceeds an internal cap (see MAX_TOTAL_HTTP_CONNECTIONS), so actual concurrency won't scale linearly past that point. A brief caveat here would prevent user confusion when e.g. -w 32 -t 32 doesn't yield 1024 real connections.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/usage.md` around lines 253 - 257, Update the “Fast downloads” section in
usage.md to mention the internal concurrency limit enforced by
transport.download_http_urls. Note that download_threads is clamped when
parallel_files × download_threads exceeds MAX_TOTAL_HTTP_CONNECTIONS, so actual
concurrency may stop scaling linearly; keep the guidance brief and tie it to the
existing parallel_files × threads explanation.
242-242: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Protocol table overstates globus/s3 support for download-px-raw-files.

For this command, only ftp and aspera are actually implemented — base.Provider.download_files (used for the non-aspera path) just warns and falls through to whatever URL scheme is in the PX XML if globus/s3 is requested; there's no genuine multi-protocol fallback here (unlike PRIDE's _protocol_sequence). Consider clarifying that globus/s3 aren't meaningfully supported for PX/iProX downloads to avoid confusing users.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/usage.md` at line 242, Clarify the `download-px-raw-files` protocol
documentation so it does not imply real `globus` or `s3` support. Update the
protocol table entry in `docs/usage.md` to reflect that the command only
meaningfully supports `ftp` and `aspera`, and note that the non-aspera path in
`base.Provider.download_files` does not implement a true fallback sequence for
other protocols.
pridepy/pridepy.py (1)
66-74: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicate --parallel-files/-w option definition across 5 commands.

Same default, range, and help text repeated verbatim. Consider extracting a shared decorator (e.g. parallel_files_option = click.option("-w", "--parallel-files", "parallel_files", default=1, type=click.IntRange(1, 32), help=...)) and applying it via @parallel_files_option at each call site, so future tweaks to bounds/wording only need one edit.

Also applies to: 179-187, 377-385, 681-689, 783-791
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/pridepy.py` around lines 66 - 74, The same `--parallel-files/-w`
option is duplicated across multiple commands, so update `pridepy.py` to define
a shared `click.option` decorator (for example a reusable
`parallel_files_option`) with the common default, `IntRange`, and help text,
then apply it to each command that currently declares the option. Keep the
existing `parallel_files` parameter name and ensure all affected command
functions use the shared decorator so future changes only need one edit.
pridepy/download/proteomexchange.py (1)
199-238: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

iProX Aspera path has no bandwidth-throttling control.

IproxProvider.aspera_download accepts maximum_bandwidth (default "100M"), but neither download_from_accession_or_url nor Client.download_px_raw_files nor the download-px-raw-files CLI expose a way to override it — every PX-driven iProX Aspera transfer is capped at 100M regardless of user intent. Given Aspera is specifically pitched for "very large bulk transfers," bandwidth control is a reasonable expectation.
♻️ Sketch of the plumbing needed
     def download_from_accession_or_url(
         self,
         px_id_or_url: str,
         output_folder: str,
         skip_if_downloaded_already: bool = True,
         flatten: bool = True,
         parallel_files: int = 1,
         download_threads: int = 1,
         protocol: str = "ftp",
+        aspera_maximum_bandwidth: str = "100M",
         iprox_user: Optional[str] = None,
         iprox_password: Optional[str] = None,
     ) -> None:
...
             IproxProvider.aspera_download(
                 urls=iprox_urls,
                 output_folder=output_folder,
                 relative_paths=dest_rels,
                 user=iprox_user,
                 password=iprox_password,
+                maximum_bandwidth=aspera_maximum_bandwidth,
                 skip_if_downloaded_already=skip_if_downloaded_already,
                 parallel_files=parallel_files,
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/download/proteomexchange.py` around lines 199 - 238, The iProX Aspera
download path currently hardcodes the default bandwidth limit because
`download_from_accession_or_url` and the PX download plumbing never pass a
bandwidth value through to `IproxProvider.aspera_download`. Add a
`maximum_bandwidth` option to the PX download entry points (including
`Client.download_px_raw_files` and the `download-px-raw-files` CLI), thread it
through `download_from_accession_or_url`, and pass it into
`IproxProvider.aspera_download` so users can override the default 100M cap; keep
the existing default unchanged for backward compatibility.
pridepy/tests/test_download_resilience.py (1)
250-282: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Good coverage of the intended clamp path; consider adding a case where workers alone exceeds the cap.

This test only exercises parallel_files=32 with 3 URLs (workers=3), which stays well under MAX_TOTAL_HTTP_CONNECTIONS. It doesn't cover the scenario where len(http_urls) and parallel_files are both large enough that workers itself exceeds the cap (see the corresponding comment on transport.py lines 852-865) — in that case download_threads floors to 1 but total connections can still exceed 64. Adding a test with e.g. 100 URLs and parallel_files=100 would catch that gap once fixed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/tests/test_download_resilience.py` around lines 250 - 282, The
current test only verifies the clamp when `parallel_files` is high but the
effective `workers` count stays below the connection cap. Add a second assertion
in `test_download_http_urls_clamps_combined_connection_cap` that uses many URLs
and a large `parallel_files` value so `workers` itself exceeds
`transport.MAX_TOTAL_HTTP_CONNECTIONS`, and verify `download_http_urls` clamps
total connections appropriately (not just per-thread `download_threads`) while
still emitting the warning.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pridepy/download/iprox.py`:
- Around line 82-97: The skip logic in `download`/`cls.download` treats any
non-empty `dest` as complete, which can incorrectly skip a failed partial Aspera
transfer. Update the `subprocess.run(argv, check=True, env=env)` flow so `ascp`
writes to a temporary/partial target and only promotes it to `dest` after
success, or add a remote-size/complete-file validation before the
`os.path.isfile(dest)` skip check. Keep the fix localized around the
`skip_if_downloaded_already` block and the Aspera command construction.
- Around line 74-75: Validate the input in IproxProvider.aspera_download()
before constructing the Aspera source path: ensure the parsed URL has the
expected iProX scheme/host, and reject any path containing traversal segments
like “..”. Update the logic around urlparse(url).path and the source string so
only normalized paths under the intended /data/iprox/<accession> subtree are
accepted before calling ascp.
- Around line 134-136: The worker count used before the ThreadPoolExecutor/ascp
launch is only bounded by len(urls), so user-controlled parallel_files can still
create too many subprocesses. Add a small provider-level maximum in the worker
calculation around the ThreadPoolExecutor block in pridepy/download/iprox.py,
clamp workers to that cap, and emit a log when the requested parallel_files
value is reduced.

In `@pridepy/download/transport.py`:
- Around line 852-865: The connection cap logic in Client.download_http_urls
only clamps download_threads, so callers can still exceed
MAX_TOTAL_HTTP_CONNECTIONS when workers is very large. Update the worker
calculation near workers and clamped_threads to bound workers as well (or
otherwise clamp the workers x download_threads product from both sides) before
starting downloads, keeping the final total HTTP connections within
MAX_TOTAL_HTTP_CONNECTIONS.

In `@pridepy/pridepy.py`:
- Around line 404-421: The password prompt path in download_px_raw_files is
running even when iprox_user was never provided. Update the aspera branch in
download_px_raw_files to validate iprox_user before calling click.prompt, and
fail fast with the existing credential error path if it is missing. Keep the fix
localized to the download_px_raw_files flow and its handoff to
IproxProvider.aspera_download so the prompt only appears when a user can
actually authenticate.

In `@pridepy/tests/test_iprox_aspera.py`:
- Line 143: Clean up the Ruff warnings in the new tests by fixing the unused
tuple unpack from run.call_args in the test that inspects the mock invocation:
assign the unused value to a dummy name (for example in the assertion around
run.call_args). Also update the pytest.raises(..., match=...) pattern in the
affected test(s) to escape the literal dot so the regex matches the exact error
text instead of treating the dot as a wildcard.
- Around line 279-281: The test in test_iprox_aspera should assert the order of
relative_paths instead of converting to a set, since its mapping to urls is
positional and a swapped pairing would be missed. Update the assertion near
asp.call_args.kwargs["relative_paths"] to compare the flattened, de-duped
basenames in sequence, preserving the existing no-subdirectories expectation
while checking the exact order.

---

Nitpick comments:
In `@docs/usage.md`:
- Around line 253-257: Update the “Fast downloads” section in usage.md to
mention the internal concurrency limit enforced by transport.download_http_urls.
Note that download_threads is clamped when parallel_files × download_threads
exceeds MAX_TOTAL_HTTP_CONNECTIONS, so actual concurrency may stop scaling
linearly; keep the guidance brief and tie it to the existing parallel_files ×
threads explanation.
- Line 242: Clarify the `download-px-raw-files` protocol documentation so it
does not imply real `globus` or `s3` support. Update the protocol table entry in
`docs/usage.md` to reflect that the command only meaningfully supports `ftp` and
`aspera`, and note that the non-aspera path in `base.Provider.download_files`
does not implement a true fallback sequence for other protocols.

In `@pridepy/download/proteomexchange.py`:
- Around line 199-238: The iProX Aspera download path currently hardcodes the
default bandwidth limit because `download_from_accession_or_url` and the PX
download plumbing never pass a bandwidth value through to
`IproxProvider.aspera_download`. Add a `maximum_bandwidth` option to the PX
download entry points (including `Client.download_px_raw_files` and the
`download-px-raw-files` CLI), thread it through
`download_from_accession_or_url`, and pass it into
`IproxProvider.aspera_download` so users can override the default 100M cap; keep
the existing default unchanged for backward compatibility.

In `@pridepy/pridepy.py`:
- Around line 66-74: The same `--parallel-files/-w` option is duplicated across
multiple commands, so update `pridepy.py` to define a shared `click.option`
decorator (for example a reusable `parallel_files_option`) with the common
default, `IntRange`, and help text, then apply it to each command that currently
declares the option. Keep the existing `parallel_files` parameter name and
ensure all affected command functions use the shared decorator so future changes
only need one edit.

In `@pridepy/tests/test_download_resilience.py`:
- Around line 250-282: The current test only verifies the clamp when
`parallel_files` is high but the effective `workers` count stays below the
connection cap. Add a second assertion in
`test_download_http_urls_clamps_combined_connection_cap` that uses many URLs and
a large `parallel_files` value so `workers` itself exceeds
`transport.MAX_TOTAL_HTTP_CONNECTIONS`, and verify `download_http_urls` clamps
total connections appropriately (not just per-thread `download_threads`) while
still emitting the warning.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f054396-3341-4f94-889a-8434aeef2736

📥 Commits

Reviewing files that changed from the base of the PR and between 8546049 and 6dc27af.

📒 Files selected for processing (9)

docs/usage.md
pridepy/download/client.py
pridepy/download/iprox.py
pridepy/download/proteomexchange.py
pridepy/download/transport.py
pridepy/pridepy.py
pridepy/tests/test_cli_flatten.py
pridepy/tests/test_download_resilience.py
pridepy/tests/test_iprox_aspera.py

coderabbitai · 2026-07-01T18:51:21Z

+        path = urlparse(url).path.lstrip("/")  # e.g. IPX.../.../a.raw
+        source = f"{user}@{cls.ASPERA_HOST}:{cls.ASPERA_ROOT}/{path}"


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the target file and nearby callers git ls-files pridepy/download/iprox.py pridepy | sed -n '1,120p' wc -l pridepy/download/iprox.py cat -n pridepy/download/iprox.py | sed -n '1,220p' # Find all references to aspera_download and ASPERA_HOST/ASPERA_ROOT rg -n "aspera_download|ASPERA_HOST|ASPERA_ROOT|urlparse\\(" pridepy # Look for any upstream URL validation in the PX router or related download entrypoints rg -n "iprox|PX router|parse.*url|url.*host|hostname|scheme|..|safe path|aspera" pridepy

Repository: PRIDE-Archive/pridepy

Length of output: 50378

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show the target file with line numbers wc -l pridepy/download/iprox.py cat -n pridepy/download/iprox.py | sed -n '1,220p' # Show only the call sites and references around iProX/Aspera handling rg -n -A4 -B4 "aspera_download|ASPERA_HOST|ASPERA_ROOT|urlparse\\(|iprox" pridepy | sed -n '1,260p' # Narrowly inspect the downloader/router path if present rg -n -A20 -B10 "class .*IPROX|def .*aspera|def .*download|router|PX" pridepy/download pridepy | sed -n '1,260p'

Repository: PRIDE-Archive/pridepy

Length of output: 44343

Validate the iProX URL before building the Aspera source path. IproxProvider.aspera_download() is public, but it ignores the parsed host/scheme and feeds urlparse(url).path straight into ascp (pridepy/download/iprox.py:74-75). A crafted URL with .. segments can escape the intended /data/iprox/<accession> subtree, so this needs host/scheme checks and path sanitization here, not only upstream.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/download/iprox.py` around lines 74 - 75, Validate the input in IproxProvider.aspera_download() before constructing the Aspera source path: ensure the parsed URL has the expected iProX scheme/host, and reject any path containing traversal segments like “..”. Update the logic around urlparse(url).path and the source string so only normalized paths under the intended /data/iprox/<accession> subtree are accepted before calling ascp.

Source: Linters/SAST tools

coderabbitai · 2026-07-01T18:51:21Z

+        if (
+            skip_if_downloaded_already
+            and os.path.isfile(dest)
+            and os.path.getsize(dest) > 0
+        ):
+            logging.info(f"Skipping download as file already exists: {dest}")
+            return None
+        argv = [
+            ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth,
+            "-k", "2", source, dest,
+        ]
+        logging.info(
+            "Aspera: %s -> %s", source.replace(password, "***"), dest
+        )
+        try:
+            subprocess.run(argv, check=True, env=env)


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Avoid marking partial Aspera outputs as complete.

If ascp writes a non-zero partial file and fails, the next --skip-if-downloaded-already run returns before -k 2 can resume it. Download to a partial target and atomically promote it only after success, or validate the remote size before skipping dest.

💾 Suggested partial-file flow

if ( skip_if_downloaded_already and os.path.isfile(dest) and os.path.getsize(dest) > 0 ): logging.info(f"Skipping download as file already exists: {dest}") return None + transfer_dest = f"{dest}.part" argv = [ ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth, - "-k", "2", source, dest, + "-k", "2", source, transfer_dest, ] logging.info( "Aspera: %s -> %s", source.replace(password, "***"), dest ) try: subprocess.run(argv, check=True, env=env) + os.replace(transfer_dest, dest) return None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (

skip_if_downloaded_already

and os.path.isfile(dest)

and os.path.getsize(dest) > 0

):

logging.info(f"Skipping download as file already exists: {dest}")

return None

argv = [

ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth,

"-k", "2", source, dest,

]

logging.info(

"Aspera: %s -> %s", source.replace(password, "***"), dest

)

try:

subprocess.run(argv, check=True, env=env)

if (

skip_if_downloaded_already

and os.path.isfile(dest)

and os.path.getsize(dest) > 0

):

logging.info(f"Skipping download as file already exists: {dest}")

return None

transfer_dest = f"{dest}.part"

argv = [

ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth,

"-k", "2", source, transfer_dest,

]

logging.info(

"Aspera: %s -> %s", source.replace(password, "***"), dest

)

try:

subprocess.run(argv, check=True, env=env)

os.replace(transfer_dest, dest)

return None

🧰 Tools

🪛 ast-grep (0.44.0)

[error] 96-96: Use of unsanitized data to create processes
Context: subprocess.run(argv, check=True, env=env)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').

(os-system-unsanitized-data)

[error] 96-96: Command coming from incoming request
Context: subprocess.run(argv, check=True, env=env)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').

(subprocess-from-request)

🪛 Ruff (0.15.20)

[error] 97-97: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/download/iprox.py` around lines 82 - 97, The skip logic in `download`/`cls.download` treats any non-empty `dest` as complete, which can incorrectly skip a failed partial Aspera transfer. Update the `subprocess.run(argv, check=True, env=env)` flow so `ascp` writes to a temporary/partial target and only promotes it to `dest` after success, or add a remote-size/complete-file validation before the `os.path.isfile(dest)` skip check. Keep the fix localized around the `skip_if_downloaded_already` block and the Aspera command construction.

coderabbitai · 2026-07-01T18:51:22Z

+        workers = max(1, min(parallel_files, len(urls)))
+        if workers > 1:
+            with ThreadPoolExecutor(max_workers=workers) as executor:


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Cap Aspera subprocess concurrency too.

parallel_files is user-controlled and each worker starts an ascp process. Large values can exhaust local processes/file descriptors and hammer the remote service; clamp it to a small provider-level maximum and log when reduced.

🚦 Suggested worker cap

+ MAX_ASPERA_PARALLEL_FILES: ClassVar[int] = 8 + - workers = max(1, min(parallel_files, len(urls))) + requested_workers = max(1, min(parallel_files, len(urls))) + workers = min(requested_workers, cls.MAX_ASPERA_PARALLEL_FILES) + if workers < requested_workers: + logging.warning( + "Reducing iProX Aspera parallel_files from %d to %d", + requested_workers, + workers, + ) if workers > 1:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

workers = max(1, min(parallel_files, len(urls)))

if workers > 1:

with ThreadPoolExecutor(max_workers=workers) as executor:

requested_workers = max(1, min(parallel_files, len(urls)))

workers = min(requested_workers, cls.MAX_ASPERA_PARALLEL_FILES)

if workers < requested_workers:

logging.warning(

"Reducing iProX Aspera parallel_files from %d to %d",

requested_workers,

workers,

)

if workers > 1:

with ThreadPoolExecutor(max_workers=workers) as executor:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/download/iprox.py` around lines 134 - 136, The worker count used before the ThreadPoolExecutor/ascp launch is only bounded by len(urls), so user-controlled parallel_files can still create too many subprocesses. Add a small provider-level maximum in the worker calculation around the ThreadPoolExecutor block in pridepy/download/iprox.py, clamp workers to that cap, and emit a log when the requested parallel_files value is reduced.

coderabbitai · 2026-07-01T18:51:22Z

    workers = max(1, min(parallel_files, len(http_urls)))
+    if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS:
+        clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers)
+        logging.warning(
+            "parallel_files (%d) x download_threads (%d) = %d exceeds the "
+            "combined connection cap of %d; reducing download_threads to %d "
+            "to keep peak HTTP connections bounded.",
+            workers,
+            download_threads,
+            workers * download_threads,
+            MAX_TOTAL_HTTP_CONNECTIONS,
+            clamped_threads,
+        )
+        download_threads = clamped_threads


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Files of interest ==" git ls-files pridepy/download/transport.py pridepy.py pridepy | sed -n '1,200p' echo echo "== transport.py around target lines ==" nl -ba pridepy/download/transport.py | sed -n '820,900p' echo echo "== Search for cap-related symbols and entry points ==" rg -n "MAX_TOTAL_HTTP_CONNECTIONS|parallel_files|download_threads|workers|download_http_urls|argparse|click|typer|fire|--parallel|--threads|-w|-t" pridepy . -g '!**/.git/**'

Repository: PRIDE-Archive/pridepy

Length of output: 3029

🏁 Script executed:

#!/bin/bash set -euo pipefail python3 - <<'PY' from pathlib import Path path = Path("pridepy/download/transport.py") lines = path.read_text().splitlines() for start, end in [(1,120), (820,900)]: print(f"== {path} lines {start}-{end} ==") for i in range(start, min(end, len(lines)) + 1): print(f"{i:4d}: {lines[i-1]}") print() PY echo "== Search CLI entry points and validations ==" python3 - <<'PY' from pathlib import Path import re targets = [ Path("pridepy/pridepy.py"), Path("pridepy/download/client.py"), Path("pridepy/download/base.py"), Path("pridepy/download/by_url.py"), Path("pridepy/tests/test_cli_flatten.py"), ] for p in targets: if p.exists(): print(f"\n## {p}") text = p.read_text().splitlines() for i, line in enumerate(text, 1): if re.search(r'parallel_files|download_threads|MAX_TOTAL_HTTP_CONNECTIONS|--parallel|--threads|-w\b|-t\b|argparse|click|typer|fire', line): print(f"{i:4d}: {line}") PY echo echo "== Broader symbol search ==" rg -n "parallel_files|download_threads|MAX_TOTAL_HTTP_CONNECTIONS|download_http_urls|--parallel|--threads|-w\\b|-t\\b|argparse|click|typer|fire" pridepy -g '!**/.git/**'

Repository: PRIDE-Archive/pridepy

Length of output: 46710

Clamp workers as well as download_threads

click.IntRange(1, 32) protects the CLI, but direct Client.download_http_urls callers can still pass parallel_files > 64. In that case clamped_threads drops to 1 and the total HTTP connections can still exceed MAX_TOTAL_HTTP_CONNECTIONS. Bound workers too, or clamp the product from both sides.

🔧 Proposed fix

- workers = max(1, min(parallel_files, len(http_urls))) + workers = max(1, min(parallel_files, len(http_urls), MAX_TOTAL_HTTP_CONNECTIONS))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

workers = max(1, min(parallel_files, len(http_urls)))

if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS:

clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers)

logging.warning(

"parallel_files (%d) x download_threads (%d) = %d exceeds the "

"combined connection cap of %d; reducing download_threads to %d "

"to keep peak HTTP connections bounded.",

workers,

download_threads,

workers * download_threads,

MAX_TOTAL_HTTP_CONNECTIONS,

clamped_threads,

)

download_threads = clamped_threads

workers = max(1, min(parallel_files, len(http_urls), MAX_TOTAL_HTTP_CONNECTIONS))

if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS:

clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers)

logging.warning(

"parallel_files (%d) x download_threads (%d) = %d exceeds the "

"combined connection cap of %d; reducing download_threads to %d "

"to keep peak HTTP connections bounded.",

workers,

download_threads,

workers * download_threads,

MAX_TOTAL_HTTP_CONNECTIONS,

clamped_threads,

)

download_threads = clamped_threads

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/download/transport.py` around lines 852 - 865, The connection cap logic in Client.download_http_urls only clamps download_threads, so callers can still exceed MAX_TOTAL_HTTP_CONNECTIONS when workers is very large. Update the worker calculation near workers and clamped_threads to bound workers as well (or otherwise clamp the workers x download_threads product from both sides) before starting downloads, keeping the final total HTTP connections within MAX_TOTAL_HTTP_CONNECTIONS.

coderabbitai · 2026-07-01T18:51:22Z

 def download_px_raw_files(
    accession: str,
    output_folder: str,
    skip_if_downloaded_already: bool,
+    protocol: str = "ftp",
+    download_threads: int = 1,
+    parallel_files: int = 1,
    preserve_structure: bool = False,
+    iprox_user: Optional[str] = None,
 ):
    """CLI wrapper to download raw files via ProteomeXchange XML."""
    files = Files()
    logging.info(f"PX accession/URL: {accession}")
+
+    password = os.environ.get("IPROX_ASPERA_PASSWORD")
+    if protocol.lower() == "aspera" and not password:
+        password = click.prompt("iProX Aspera password", hide_input=True)
+


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Password prompt fires even when --iprox-user was never provided.

If protocol == "aspera" and IPROX_ASPERA_PASSWORD isn't set, click.prompt runs unconditionally — even if iprox_user is None. The user is prompted for a password only to hit IproxProvider.aspera_download's credential-validation error afterward. Fail fast before prompting.

🛡️ Proposed fix

password = os.environ.get("IPROX_ASPERA_PASSWORD") - if protocol.lower() == "aspera" and not password: + if protocol.lower() == "aspera" and not iprox_user: + raise click.UsageError("--iprox-user is required with --protocol aspera.") + if protocol.lower() == "aspera" and not password: password = click.prompt("iProX Aspera password", hide_input=True)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def download_px_raw_files(

accession: str,

output_folder: str,

skip_if_downloaded_already: bool,

protocol: str = "ftp",

download_threads: int = 1,

parallel_files: int = 1,

preserve_structure: bool = False,

iprox_user: Optional[str] = None,

):

"""CLI wrapper to download raw files via ProteomeXchange XML."""

files = Files()

logging.info(f"PX accession/URL: {accession}")

password = os.environ.get("IPROX_ASPERA_PASSWORD")

if protocol.lower() == "aspera" and not password:

password = click.prompt("iProX Aspera password", hide_input=True)

def download_px_raw_files(

accession: str,

output_folder: str,

skip_if_downloaded_already: bool,

protocol: str = "ftp",

download_threads: int = 1,

parallel_files: int = 1,

preserve_structure: bool = False,

iprox_user: Optional[str] = None,

):

"""CLI wrapper to download raw files via ProteomeXchange XML."""

files = Files()

logging.info(f"PX accession/URL: {accession}")

password = os.environ.get("IPROX_ASPERA_PASSWORD")

if protocol.lower() == "aspera" and not iprox_user:

raise click.UsageError("--iprox-user is required with --protocol aspera.")

if protocol.lower() == "aspera" and not password:

password = click.prompt("iProX Aspera password", hide_input=True)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/pridepy.py` around lines 404 - 421, The password prompt path in download_px_raw_files is running even when iprox_user was never provided. Update the aspera branch in download_px_raw_files to validate iprox_user before calling click.prompt, and fail fast with the existing credential error path if it is missing. Keep the fix localized to the download_px_raw_files flow and its handoff to IproxProvider.aspera_download so the prompt only appears when a user can actually authenticate.

coderabbitai · 2026-07-01T18:51:22Z

+                password="secret",
+                maximum_bandwidth="100M",
+            )
+        args, kwargs = run.call_args


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Clean up the Ruff warnings in the new tests.

Use a dummy name for the unused unpacked value and escape the literal dot in the pytest.raises(..., match=...) pattern.

🧹 Suggested lint fixes

- args, kwargs = run.call_args + args, _kwargs = run.call_args argv = args[0]

- with pytest.raises(RuntimeError, match="b.raw"): + with pytest.raises(RuntimeError, match=re.escape("b.raw")):

Also applies to: 166-166

🧰 Tools

🪛 Ruff (0.15.20)

[warning] 143-143: Unpacked variable kwargs is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/tests/test_iprox_aspera.py` at line 143, Clean up the Ruff warnings in the new tests by fixing the unused tuple unpack from run.call_args in the test that inspects the mock invocation: assign the unused value to a dummy name (for example in the assertion around run.call_args). Also update the pytest.raises(..., match=...) pattern in the affected test(s) to escape the literal dot so the regex matches the exact error text instead of treating the dot as a wildcard.

Source: Linters/SAST tools

coderabbitai · 2026-07-01T18:51:22Z

+        rels = asp.call_args.kwargs["relative_paths"]
+        # Flattened + de-duped basenames, no subdirectories preserved.
+        assert set(rels) == {"a.raw", "a_1.raw"}


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Assert flattened paths in order.

relative_paths is positional with urls; using a set would miss a swapped mapping.

✅ Suggested assertion

rels = asp.call_args.kwargs["relative_paths"] # Flattened + de-duped basenames, no subdirectories preserved. - assert set(rels) == {"a.raw", "a_1.raw"} + assert rels == ["a.raw", "a_1.raw"]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

rels = asp.call_args.kwargs["relative_paths"]

# Flattened + de-duped basenames, no subdirectories preserved.

assert set(rels) == {"a.raw", "a_1.raw"}

rels = asp.call_args.kwargs["relative_paths"]

# Flattened + de-duped basenames, no subdirectories preserved.

assert rels == ["a.raw", "a_1.raw"]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/tests/test_iprox_aspera.py` around lines 279 - 281, The test in test_iprox_aspera should assert the order of relative_paths instead of converting to a set, since its mapping to urls is positional and a swapped pairing would be missed. Update the assertion near asp.call_args.kwargs["relative_paths"] to compare the flattened, de-duped basenames in sequence, preserving the existing no-subdirectories expectation while checking the exact order.

qodo-code-review · 2026-07-01T18:57:09Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📜 Skill insights (0)

1. Aspera prompt can hang 🐞 Bug ☼ Reliability

Description

download_px_raw_files prompts for an iProX Aspera password as soon as --protocol aspera is set
and IPROX_ASPERA_PASSWORD is missing, before verifying --iprox-user is present or that the
dataset contains iProX-hosted files. This can block non-interactive executions and unnecessarily
asks for a password even when the code will later error out.

Code

pridepy/pridepy.py[R418-420]

+    password = os.environ.get("IPROX_ASPERA_PASSWORD")
+    if protocol.lower() == "aspera" and not password:
+        password = click.prompt("iProX Aspera password", hide_input=True)

Evidence

The CLI unconditionally prompts when --protocol aspera is set and no env password is present, but
the actual Aspera path later rejects missing credentials and can also reject datasets with no
iProX-hosted files—both after the prompt would already have occurred.

pridepy/pridepy.py[418-432]
pridepy/download/iprox.py[123-128]
pridepy/download/proteomexchange.py[199-211]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The CLI prompts for the iProX Aspera password too early (immediately on `--protocol aspera`), which can hang in non-interactive contexts and can prompt even if the dataset has no iProX-hosted files or `--iprox-user` is missing.

## Issue Context
- Aspera mode is only valid if (a) the dataset has iProX-hosted files and (b) credentials are provided.
- Current flow prompts before either condition is validated.

## Fix Focus Areas
- pridepy/pridepy.py[418-431]
- pridepy/download/proteomexchange.py[199-238]

### Suggested fix
1) In the CLI handler, if `protocol.lower() == "aspera"` and `iprox_user` is missing, fail fast with a `click.UsageError` (no prompt).
2) Avoid prompting until you know Aspera will be used:
  - Option A (preferred): do a lightweight preflight listing (PX XML fetch) to confirm at least one file is hosted on `download.iprox.org`; if none, error without prompting.
  - Option B: call the downloader with `iprox_password=None` and, on the specific "missing credentials" error, prompt only if stdin is a TTY; otherwise raise a clear error instructing to set `IPROX_ASPERA_PASSWORD`.
3) If prompting is needed, only prompt when stdin is interactive; otherwise require `IPROX_ASPERA_PASSWORD`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. iProX creds text ambiguous 🐞 Bug ⚙ Maintainability

Description

The iProX repository description still says pridepy uses the public HTTP endpoint so no iProX
credentials are required, without clarifying that credentials are required when opting into the new
--protocol aspera mode. This can confuse users about when --iprox-user/IPROX_ASPERA_PASSWORD
are needed.

Code

docs/usage.md[323]

+- **iProX** fetches the dataset's ProteomeXchange XML from `http://download.iprox.org/<accession>/PX_<accession>.xml`, then downloads each referenced file from the same host over anonymous HTTP (with `Range` support for resume). iProX also exposes Aspera with username/password for very large bulk transfers; `pridepy` uses the public HTTP endpoint so no iProX credentials are required.

Evidence

The docs explicitly document credentialed iProX Aspera usage, but the iProX bullet later states
(without qualification) that pridepy uses HTTP and no credentials are required, which becomes
ambiguous after adding Aspera support.

docs/usage.md[279-294]
docs/usage.md[323-323]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The iProX docs section implies no iProX credentials are required, but the same document also introduces optional credentialed Aspera mode. The repo description should distinguish default HTTP behavior from opt-in Aspera behavior.

## Issue Context
Users reading the "How each repository is enumerated" section may miss the earlier Aspera section and assume credentials are never needed.

## Fix Focus Areas
- docs/usage.md[279-294]
- docs/usage.md[323-323]

### Suggested fix
Rewrite the iProX bullet to something like:
"... downloads over anonymous HTTP by default (no credentials required). iProX also supports opt-in Aspera transfers (`--protocol aspera`), which require an iProX account (`--iprox-user` and `IPROX_ASPERA_PASSWORD`)."

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-07-01T18:57:10Z

+    password = os.environ.get("IPROX_ASPERA_PASSWORD")
+    if protocol.lower() == "aspera" and not password:
+        password = click.prompt("iProX Aspera password", hide_input=True)


1. Aspera prompt can hang 🐞 Bug ☼ Reliability

download_px_raw_files prompts for an iProX Aspera password as soon as --protocol aspera is set and IPROX_ASPERA_PASSWORD is missing, before verifying --iprox-user is present or that the dataset contains iProX-hosted files. This can block non-interactive executions and unnecessarily asks for a password even when the code will later error out.

Agent Prompt

## Issue description The CLI prompts for the iProX Aspera password too early (immediately on `--protocol aspera`), which can hang in non-interactive contexts and can prompt even if the dataset has no iProX-hosted files or `--iprox-user` is missing. ## Issue Context - Aspera mode is only valid if (a) the dataset has iProX-hosted files and (b) credentials are provided. - Current flow prompts before either condition is validated. ## Fix Focus Areas - pridepy/pridepy.py[418-431] - pridepy/download/proteomexchange.py[199-238] ### Suggested fix 1) In the CLI handler, if `protocol.lower() == "aspera"` and `iprox_user` is missing, fail fast with a `click.UsageError` (no prompt). 2) Avoid prompting until you know Aspera will be used: - Option A (preferred): do a lightweight preflight listing (PX XML fetch) to confirm at least one file is hosted on `download.iprox.org`; if none, error without prompting. - Option B: call the downloader with `iprox_password=None` and, on the specific "missing credentials" error, prompt only if stdin is a TTY; otherwise raise a clear error instructing to set `IPROX_ASPERA_PASSWORD`. 3) If prompting is needed, only prompt when stdin is interactive; otherwise require `IPROX_ASPERA_PASSWORD`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-07-01T18:57:10Z

 - **MassIVE** walks the FTPS tree at `massive-ftp.ucsd.edu` (the server requires TLS). MassIVE distributes datasets across versioned root directories (`/v01`–`/vNN`); `pridepy` discovers the correct root automatically. If FTP/FTPS is blocked by the network, `pridepy` falls back to HTTPS: it lists the dataset from the GNPS2 file index (`datasetcache.gnps2.org`) and downloads each file from the ProteoSAFe endpoint at `massive.ucsd.edu` (byte-identical to the FTPS copy).
 - **JPOST** lists files through the JSON PROXI endpoint at `https://repository.jpostdb.org/proxi/datasets/<JPSTxxxxxx>` and downloads from `ftp.jpostdb.org` over plain FTP. The PROXI listing avoids the source-IP connection limit JPOST enforces on FTP.
- **iProX** fetches the dataset's ProteomeXchange XML from `http://download.iprox.org/<accession>/PX_<accession>.xml`, then downloads each referenced file from the same host over anonymous HTTP (with `Range` support for resume). iProX also exposes Aspera (`faspe://`) with username/password for very large bulk transfers; `pridepy` uses the public HTTP endpoint so no iProX credentials are required.
+- **iProX** fetches the dataset's ProteomeXchange XML from `http://download.iprox.org/<accession>/PX_<accession>.xml`, then downloads each referenced file from the same host over anonymous HTTP (with `Range` support for resume). iProX also exposes Aspera with username/password for very large bulk transfers; `pridepy` uses the public HTTP endpoint so no iProX credentials are required.


2. Iprox creds text ambiguous 🐞 Bug ⚙ Maintainability

The iProX repository description still says pridepy uses the public HTTP endpoint so no iProX credentials are required, without clarifying that credentials are required when opting into the new --protocol aspera mode. This can confuse users about when --iprox-user/IPROX_ASPERA_PASSWORD are needed.

Agent Prompt

## Issue description The iProX docs section implies no iProX credentials are required, but the same document also introduces optional credentialed Aspera mode. The repo description should distinguish default HTTP behavior from opt-in Aspera behavior. ## Issue Context Users reading the "How each repository is enumerated" section may miss the earlier Aspera section and assume credentials are never needed. ## Fix Focus Areas - docs/usage.md[279-294] - docs/usage.md[323-323] ### Suggested fix Rewrite the iProX bullet to something like: "... downloads over anonymous HTTP by default (no credentials required). iProX also supports opt-in Aspera transfers (`--protocol aspera`), which require an iProX account (`--iprox-user` and `IPROX_ASPERA_PASSWORD`)."

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ypriverol added 5 commits July 1, 2026 17:16

feat(cli): expose -w/--parallel-files across-file concurrency; wire d…

1414449

…ownload-px-raw-files (protocol/threads/parallel)

docs: document -w/--parallel-files and iProX Aspera for PX downloads

b91613e

fix(download): harden iProX Aspera (path traversal, skip logic, exact…

14b00f9

… host, honor -w) + concurrency cap + hidden password prompt

Merge pull request #115 from PRIDE-Archive/feat/iprox-fast-download

6dc27af

Faster iProX/ProteomeXchange downloads: expose -w parallel files, wire download-px-raw-files, opt-in iProX Aspera

ypriverol requested a review from Copilot July 1, 2026 18:41

Copilot started reviewing on behalf of ypriverol July 1, 2026 18:41 View session

Copilot AI reviewed Jul 1, 2026

View reviewed changes

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

qodo-code-review Bot reviewed Jul 1, 2026

View reviewed changes

		path = urlparse(url).path.lstrip("/") # e.g. IPX.../.../a.raw
		source = f"{user}@{cls.ASPERA_HOST}:{cls.ASPERA_ROOT}/{path}"

Uh oh!

Conversation

ypriverol commented Jul 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 inconclusive)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

qodo-code-review Bot commented Jul 1, 2026

PR Summary by Qodo

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot commented Jul 1, 2026

Code Review by Qodo

Uh oh!

qodo-code-review Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading