Skip to content

Update for iprox downloads. #116

Open
ypriverol wants to merge 5 commits into
masterfrom
dev
Open

Update for iprox downloads. #116
ypriverol wants to merge 5 commits into
masterfrom
dev

Conversation

@ypriverol

@ypriverol ypriverol commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

This pull request introduces significant improvements to the pridepy tool, focusing on enhanced download performance and expanded protocol support for bulk dataset retrieval, especially from iProX. The main changes add parallel file downloads, per-file HTTP range threading, and robust Aspera (fast transfer) support for iProX datasets, along with corresponding updates to CLI options and documentation.

Major feature additions and enhancements:

Parallel and segmented downloads:

  • Added support for downloading multiple files in parallel (--parallel-files/-w) and for using multiple threads per file (--threads/-t), allowing faster and more efficient bulk downloads. The total number of concurrent HTTP connections is capped for stability. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Aspera protocol integration for iProX:

  • Implemented Aspera (--protocol aspera) as a fast transfer option for iProX datasets, including secure password handling and support for parallel Aspera transfers. The CLI and download logic now route iProX downloads via Aspera when requested and credentials are provided. [1] [2] [3] [4]

User interface and documentation updates:

CLI and usage documentation:

  • Updated CLI options to expose new parallelism and protocol flags, and expanded the documentation (docs/usage.md) with detailed instructions and examples for fast downloads using parallelism and Aspera. Also clarified iProX password handling for Aspera. [1] [2] [3] [4]
  • Updated repository enumeration documentation to clarify Aspera support and iProX credential requirements.

Internal API and codebase improvements:

Refactoring and extensibility:

  • Refactored internal APIs to propagate new parallelism and protocol parameters through client, provider, and transport layers, ensuring consistent handling and future extensibility. [1] [2] [3] [4] [5]
  • Added utility functions and improved error handling for parallel transfers and protocol selection. [1] [2]

These changes collectively make pridepy much more capable for power users and large-scale dataset retrieval, while keeping the tool user-friendly and robust.

Summary by CodeRabbit

  • New Features

    • Added faster downloads with parallel file support and per-file segment threads.
    • Added iProX Aspera download support, including credential prompts and environment-variable handling.
    • Added a new download protocol option and support for preserving or flattening file paths.
  • Bug Fixes

    • Improved handling of mixed-source ProteomeXchange downloads by skipping unsupported files when needed.
    • Added safeguards to keep downloads inside the selected output folder and avoid duplicate or partial-file skips.
  • Documentation

    • Expanded usage docs with examples and guidance for the new download options.

ypriverol added 5 commits July 1, 2026 17:16
…ownload-px-raw-files (protocol/threads/parallel)
…-files

Adds IproxProvider.aspera_download (ascp on port 33001, credentials via
ASPERA_SCP_PASS env only) and routes ProteomeXchangeProvider through it
when --protocol aspera is selected, gated on iProX-hosted URLs. Plumbs
--iprox-user/--iprox-password (env: IPROX_USER/IPROX_ASPERA_PASSWORD)
through Client.download_px_raw_files and the CLI. HTTP-parallel remains
the default transport.
… host, honor -w) + concurrency cap + hidden password prompt
Faster iProX/ProteomeXchange downloads: expose -w parallel files, wire download-px-raw-files, opt-in iProX Aspera
@ypriverol ypriverol requested a review from Copilot July 1, 2026 18:41
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds Aspera-based fast downloads for iProX datasets, introduces parallel_files, download_threads, and protocol parameters across the CLI, Client, and ProteomeXchangeProvider, clamps combined HTTP connection concurrency, and updates documentation and tests accordingly.

Changes

iProX Aspera and parallel download plumbing

Layer / File(s) Summary
Aspera transfer implementation
pridepy/download/iprox.py
Adds Aspera connection constants, _ascp_binary(), _aspera_download_one(), and aspera_download() for single and concurrent multi-file transfers via ascp, with credential validation, skip-if-exists logic, and failure aggregation.
ProteomeXchange routing and parameter propagation
pridepy/download/proteomexchange.py
download_from_accession_or_url gains parallel_files, download_threads, protocol, and iProX credential parameters; routes to IproxProvider.aspera_download for iProX-hosted URLs when protocol="aspera", otherwise forwards concurrency params to download_files.
Client API forwarding
pridepy/download/client.py
Client.download_px_raw_files signature extended and delegates new parameters to ProteomeXchangeProvider.
HTTP connection concurrency cap
pridepy/download/transport.py, pridepy/tests/test_download_resilience.py
Adds MAX_TOTAL_HTTP_CONNECTIONS and clamps download_threads when combined with parallel_files would exceed it, with a test validating the clamp and warning.
CLI options and wiring
pridepy/pridepy.py, pridepy/tests/test_cli_flatten.py
Adds --parallel-files/-w across multiple download commands, refactors download-px-raw-files with protocol, download_threads, --iprox-user, and password env/prompt handling; tests updated to assert forwarded kwargs.
Aspera unit tests
pridepy/tests/test_iprox_aspera.py
New test suite covering ascp command/env construction, credential errors, failure handling, skip logic, traversal safety, and ProteomeXchange aspera routing/flattening.
Documentation
docs/usage.md
Documents new fast-download options, parallel/segmented download examples, and iProX Aspera usage with password handling guidance.

Estimated code review effort: 4 (Complex) | ~60 minutes

Possibly related PRs

  • PRIDE-Archive/pridepy#106: Builds directly on the download architecture for MassIVE/JPOST/iProX/ProteomeXchange that this PR extends with Aspera transfers and concurrency plumbing.
  • PRIDE-Archive/pridepy#93: Both propagate parallel_files through the CLI to Files.* download entrypoints on the same integration surface.
  • PRIDE-Archive/pridepy#110: Both modify HTTP threading in transport.py, extending/propagating download_threads through download_http_urls.

Suggested reviewers: chakrabandla

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is related to the change, but it is too vague to convey the main update in the PR. Use a more specific title like "Add parallel and Aspera support for iProX downloads".
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades pridepy’s bulk-download capabilities by adding across-file parallelism (--parallel-files) and integrating iProX Aspera transfers for faster ProteomeXchange (PX) downloads, with corresponding CLI, transport-layer, and documentation updates.

Changes:

  • Add --parallel-files/-w plumbing through CLI → client → provider → transport, plus a combined HTTP connection cap.
  • Add iProX Aspera support (ascp command construction, credential handling, and PX routing when --protocol aspera is selected).
  • Expand/adjust tests and usage documentation for parallelism and Aspera.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pridepy/tests/test_iprox_aspera.py New tests for iProX Aspera command/env construction, routing, flattening, and safety checks.
pridepy/tests/test_download_resilience.py Adds a regression test ensuring the combined connection cap clamps per-file threads.
pridepy/tests/test_cli_flatten.py Updates CLI tests to assert parallel_files default/flag behavior across commands.
pridepy/pridepy.py Adds CLI flags for --parallel-files, PX --protocol, and iProX credential handling/prompting.
pridepy/download/transport.py Introduces MAX_TOTAL_HTTP_CONNECTIONS and clamps HTTP concurrency.
pridepy/download/proteomexchange.py Routes PX downloads to iProX Aspera when requested; supports flattening in that branch.
pridepy/download/iprox.py Implements iProX Aspera transfer execution (including parallel subprocess runs) and safe destination handling.
pridepy/download/client.py Propagates new PX download parameters through the client API.
docs/usage.md Documents new PX CLI options, parallelism guidance, and iProX Aspera password handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 851 to 866
failed: List[str] = []
workers = max(1, min(parallel_files, len(http_urls)))
if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS:
clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers)
logging.warning(
"parallel_files (%d) x download_threads (%d) = %d exceeds the "
"combined connection cap of %d; reducing download_threads to %d "
"to keep peak HTTP connections bounded.",
workers,
download_threads,
workers * download_threads,
MAX_TOTAL_HTTP_CONNECTIONS,
clamped_threads,
)
download_threads = clamped_threads
if workers > 1:
Comment thread pridepy/pridepy.py
Comment on lines +418 to +421
password = os.environ.get("IPROX_ASPERA_PASSWORD")
if protocol.lower() == "aspera" and not password:
password = click.prompt("iProX Aspera password", hide_input=True)

@qodo-code-review

Copy link
Copy Markdown
Contributor

PR Summary by Qodo

Speed up PX/iProX downloads with parallelism and opt-in Aspera support

✨ Enhancement 🐞 Bug fix 🧪 Tests 📝 Documentation 🕐 40+ Minutes

Grey Divider

AI Description

• Add across-file parallelism (-w) and per-file HTTP Range threading (-t) for bulk downloads.
• Support iProX Aspera transfers for PX datasets when --protocol aspera and credentials are
 provided.
• Clamp peak HTTP connections and harden Aspera path/host handling; document and test behavior.
Diagram

graph TD
  A["CLI (pridepy.py)"] --> B["Client.download_px_raw_files"] --> C["ProteomeXchangeProvider"]
  C --> D["HTTP/FTP transport"] --> F{{"download.iprox.org (HTTP)"}}
  C --> E["IproxProvider.aspera_download"] --> G{{"ascp subprocess (Aspera)"}}
  D --> H["Concurrency cap (MAX_TOTAL_HTTP_CONNECTIONS)"]
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Async IO for HTTP downloads (aiohttp)
  • ➕ Better scalability than threads for many small files
  • ➕ Easier global concurrency limiting and cancellation
  • ➖ Higher implementation complexity and refactor cost across transport/provider layers
  • ➖ Multipart Range download and retry logic must be reimplemented carefully
2. Delegate HTTP downloads to aria2c
  • ➕ Mature segmented + parallel downloads with robust retry/resume
  • ➕ Built-in connection limiting and bandwidth controls
  • ➖ Adds an external runtime dependency similar to ascp
  • ➖ Harder to integrate with current progress/reporting and per-provider path logic
3. Introduce a formal Transport strategy interface
  • ➕ Clear separation between protocol selection and provider listing logic
  • ➕ Easier to add future protocols consistently (aspera/globus/s3)
  • ➖ More up-front refactor with little immediate user-visible benefit
  • ➖ Risk of churn across multiple providers and tests

Recommendation: The PR’s approach is appropriate for pridepy’s current architecture: it reuses existing HTTP Range logic, adds a simple global connection cap for safety, and uses the canonical ascp subprocess for Aspera while keeping credentials out of argv. Consider a transport strategy interface later if more protocol-specific behaviors continue to accumulate in providers.

Files changed (9) +805 / -10

Enhancement (4) +306 / -4
client.pyPlumb PX download protocol and concurrency options through Client facade +14/-1

Plumb PX download protocol and concurrency options through Client facade

• Extends Client.download_px_raw_files signature to accept protocol, parallel_files, and per-file download_threads plus optional iProX credentials. Forwards these parameters to ProteomeXchangeProvider for end-to-end handling.

pridepy/download/client.py

iprox.pyAdd iProX Aspera downloader with safe paths, skipping logic, and parallel transfers +134/-0

Add iProX Aspera downloader with safe paths, skipping logic, and parallel transfers

• Introduces an ascp-based Aspera download path for iProX URLs using ASPERA_SCP_PASS (env) and optional parallel_files subprocess concurrency. Uses _safe_join to prevent path traversal and improves skip-if-downloaded checks to avoid directory/zero-byte false skips.

pridepy/download/iprox.py

proteomexchange.pyRoute PX downloads to iProX Aspera when requested and validate host/paths +60/-3

Route PX downloads to iProX Aspera when requested and validate host/paths

• Extends download_from_accession_or_url to accept protocol, concurrency, and iProX credentials. When protocol=aspera, filters exact download.iprox.org hosts, warns on mixed-host datasets, honors flattening via flatten_relative_paths, and delegates to IproxProvider.aspera_download.

pridepy/download/proteomexchange.py

pridepy.pyExpose -w parallel files and PX protocol/credential options in CLI +98/-0

Expose -w parallel files and PX protocol/credential options in CLI

• Adds -w/--parallel-files to multiple download commands and introduces PX-specific options for --protocol, -t/--threads, and --iprox-user. Implements secure password prompting for iProX Aspera when IPROX_ASPERA_PASSWORD is not set.

pridepy/pridepy.py

Bug fix (1) +18 / -0
transport.pyClamp total HTTP connections for parallel files × per-file threads +18/-0

Clamp total HTTP connections for parallel files × per-file threads

• Adds MAX_TOTAL_HTTP_CONNECTIONS and clamps download_threads when parallel_files × download_threads would exceed the cap. Emits a warning explaining the reduction to keep peak HTTP connections bounded.

pridepy/download/transport.py

Tests (3) +428 / -5
test_cli_flatten.pyAdd CLI tests for -w parallel_files and PX protocol/threads plumbing +95/-5

Add CLI tests for -w parallel_files and PX protocol/threads plumbing

• Updates existing CLI flatten tests to assert parallel_files default propagation and adds new cases to validate -w behavior across commands. Adds a PX CLI test asserting protocol, threads, and parallel_files are forwarded correctly.

pridepy/tests/test_cli_flatten.py

test_download_resilience.pyTest connection-cap clamping for parallel HTTP Range downloads +33/-0

Test connection-cap clamping for parallel HTTP Range downloads

• Adds a regression test asserting parallel_files × download_threads is clamped to MAX_TOTAL_HTTP_CONNECTIONS and that a warning is logged. Verifies clamped per-file thread count is consistently applied.

pridepy/tests/test_download_resilience.py

test_iprox_aspera.pyAdd unit tests for iProX Aspera command/env behavior and PX routing +300/-0

Add unit tests for iProX Aspera command/env behavior and PX routing

• Introduces comprehensive tests for Aspera argv construction, env-only password handling, skip logic edge cases, traversal protection, parallel failure aggregation, and ProteomeXchangeProvider routing with exact-host validation and flattening behavior.

pridepy/tests/test_iprox_aspera.py

Documentation (1) +53 / -1
usage.mdDocument PX parallelism flags and iProX Aspera credential handling +53/-1

Document PX parallelism flags and iProX Aspera credential handling

• Adds CLI option documentation for --protocol, -w/--parallel-files, -t/--threads, and --preserve-structure. Includes usage examples for faster downloads and clarifies iProX Aspera password handling via env var or secure prompt.

docs/usage.md

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (5)
docs/usage.md (2)

253-257: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider noting the internal concurrency cap.

transport.download_http_urls clamps download_threads when parallel_files × download_threads exceeds an internal cap (see MAX_TOTAL_HTTP_CONNECTIONS), so actual concurrency won't scale linearly past that point. A brief caveat here would prevent user confusion when e.g. -w 32 -t 32 doesn't yield 1024 real connections.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/usage.md` around lines 253 - 257, Update the “Fast downloads” section in
usage.md to mention the internal concurrency limit enforced by
transport.download_http_urls. Note that download_threads is clamped when
parallel_files × download_threads exceeds MAX_TOTAL_HTTP_CONNECTIONS, so actual
concurrency may stop scaling linearly; keep the guidance brief and tie it to the
existing parallel_files × threads explanation.

242-242: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Protocol table overstates globus/s3 support for download-px-raw-files.

For this command, only ftp and aspera are actually implemented — base.Provider.download_files (used for the non-aspera path) just warns and falls through to whatever URL scheme is in the PX XML if globus/s3 is requested; there's no genuine multi-protocol fallback here (unlike PRIDE's _protocol_sequence). Consider clarifying that globus/s3 aren't meaningfully supported for PX/iProX downloads to avoid confusing users.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/usage.md` at line 242, Clarify the `download-px-raw-files` protocol
documentation so it does not imply real `globus` or `s3` support. Update the
protocol table entry in `docs/usage.md` to reflect that the command only
meaningfully supports `ftp` and `aspera`, and note that the non-aspera path in
`base.Provider.download_files` does not implement a true fallback sequence for
other protocols.
pridepy/pridepy.py (1)

66-74: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicate --parallel-files/-w option definition across 5 commands.

Same default, range, and help text repeated verbatim. Consider extracting a shared decorator (e.g. parallel_files_option = click.option("-w", "--parallel-files", "parallel_files", default=1, type=click.IntRange(1, 32), help=...)) and applying it via @parallel_files_option at each call site, so future tweaks to bounds/wording only need one edit.

Also applies to: 179-187, 377-385, 681-689, 783-791

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/pridepy.py` around lines 66 - 74, The same `--parallel-files/-w`
option is duplicated across multiple commands, so update `pridepy.py` to define
a shared `click.option` decorator (for example a reusable
`parallel_files_option`) with the common default, `IntRange`, and help text,
then apply it to each command that currently declares the option. Keep the
existing `parallel_files` parameter name and ensure all affected command
functions use the shared decorator so future changes only need one edit.
pridepy/download/proteomexchange.py (1)

199-238: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

iProX Aspera path has no bandwidth-throttling control.

IproxProvider.aspera_download accepts maximum_bandwidth (default "100M"), but neither download_from_accession_or_url nor Client.download_px_raw_files nor the download-px-raw-files CLI expose a way to override it — every PX-driven iProX Aspera transfer is capped at 100M regardless of user intent. Given Aspera is specifically pitched for "very large bulk transfers," bandwidth control is a reasonable expectation.

♻️ Sketch of the plumbing needed
     def download_from_accession_or_url(
         self,
         px_id_or_url: str,
         output_folder: str,
         skip_if_downloaded_already: bool = True,
         flatten: bool = True,
         parallel_files: int = 1,
         download_threads: int = 1,
         protocol: str = "ftp",
+        aspera_maximum_bandwidth: str = "100M",
         iprox_user: Optional[str] = None,
         iprox_password: Optional[str] = None,
     ) -> None:
...
             IproxProvider.aspera_download(
                 urls=iprox_urls,
                 output_folder=output_folder,
                 relative_paths=dest_rels,
                 user=iprox_user,
                 password=iprox_password,
+                maximum_bandwidth=aspera_maximum_bandwidth,
                 skip_if_downloaded_already=skip_if_downloaded_already,
                 parallel_files=parallel_files,
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/download/proteomexchange.py` around lines 199 - 238, The iProX Aspera
download path currently hardcodes the default bandwidth limit because
`download_from_accession_or_url` and the PX download plumbing never pass a
bandwidth value through to `IproxProvider.aspera_download`. Add a
`maximum_bandwidth` option to the PX download entry points (including
`Client.download_px_raw_files` and the `download-px-raw-files` CLI), thread it
through `download_from_accession_or_url`, and pass it into
`IproxProvider.aspera_download` so users can override the default 100M cap; keep
the existing default unchanged for backward compatibility.
pridepy/tests/test_download_resilience.py (1)

250-282: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Good coverage of the intended clamp path; consider adding a case where workers alone exceeds the cap.

This test only exercises parallel_files=32 with 3 URLs (workers=3), which stays well under MAX_TOTAL_HTTP_CONNECTIONS. It doesn't cover the scenario where len(http_urls) and parallel_files are both large enough that workers itself exceeds the cap (see the corresponding comment on transport.py lines 852-865) — in that case download_threads floors to 1 but total connections can still exceed 64. Adding a test with e.g. 100 URLs and parallel_files=100 would catch that gap once fixed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/tests/test_download_resilience.py` around lines 250 - 282, The
current test only verifies the clamp when `parallel_files` is high but the
effective `workers` count stays below the connection cap. Add a second assertion
in `test_download_http_urls_clamps_combined_connection_cap` that uses many URLs
and a large `parallel_files` value so `workers` itself exceeds
`transport.MAX_TOTAL_HTTP_CONNECTIONS`, and verify `download_http_urls` clamps
total connections appropriately (not just per-thread `download_threads`) while
still emitting the warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pridepy/download/iprox.py`:
- Around line 82-97: The skip logic in `download`/`cls.download` treats any
non-empty `dest` as complete, which can incorrectly skip a failed partial Aspera
transfer. Update the `subprocess.run(argv, check=True, env=env)` flow so `ascp`
writes to a temporary/partial target and only promotes it to `dest` after
success, or add a remote-size/complete-file validation before the
`os.path.isfile(dest)` skip check. Keep the fix localized around the
`skip_if_downloaded_already` block and the Aspera command construction.
- Around line 74-75: Validate the input in IproxProvider.aspera_download()
before constructing the Aspera source path: ensure the parsed URL has the
expected iProX scheme/host, and reject any path containing traversal segments
like “..”. Update the logic around urlparse(url).path and the source string so
only normalized paths under the intended /data/iprox/<accession> subtree are
accepted before calling ascp.
- Around line 134-136: The worker count used before the ThreadPoolExecutor/ascp
launch is only bounded by len(urls), so user-controlled parallel_files can still
create too many subprocesses. Add a small provider-level maximum in the worker
calculation around the ThreadPoolExecutor block in pridepy/download/iprox.py,
clamp workers to that cap, and emit a log when the requested parallel_files
value is reduced.

In `@pridepy/download/transport.py`:
- Around line 852-865: The connection cap logic in Client.download_http_urls
only clamps download_threads, so callers can still exceed
MAX_TOTAL_HTTP_CONNECTIONS when workers is very large. Update the worker
calculation near workers and clamped_threads to bound workers as well (or
otherwise clamp the workers x download_threads product from both sides) before
starting downloads, keeping the final total HTTP connections within
MAX_TOTAL_HTTP_CONNECTIONS.

In `@pridepy/pridepy.py`:
- Around line 404-421: The password prompt path in download_px_raw_files is
running even when iprox_user was never provided. Update the aspera branch in
download_px_raw_files to validate iprox_user before calling click.prompt, and
fail fast with the existing credential error path if it is missing. Keep the fix
localized to the download_px_raw_files flow and its handoff to
IproxProvider.aspera_download so the prompt only appears when a user can
actually authenticate.

In `@pridepy/tests/test_iprox_aspera.py`:
- Line 143: Clean up the Ruff warnings in the new tests by fixing the unused
tuple unpack from run.call_args in the test that inspects the mock invocation:
assign the unused value to a dummy name (for example in the assertion around
run.call_args). Also update the pytest.raises(..., match=...) pattern in the
affected test(s) to escape the literal dot so the regex matches the exact error
text instead of treating the dot as a wildcard.
- Around line 279-281: The test in test_iprox_aspera should assert the order of
relative_paths instead of converting to a set, since its mapping to urls is
positional and a swapped pairing would be missed. Update the assertion near
asp.call_args.kwargs["relative_paths"] to compare the flattened, de-duped
basenames in sequence, preserving the existing no-subdirectories expectation
while checking the exact order.

---

Nitpick comments:
In `@docs/usage.md`:
- Around line 253-257: Update the “Fast downloads” section in usage.md to
mention the internal concurrency limit enforced by transport.download_http_urls.
Note that download_threads is clamped when parallel_files × download_threads
exceeds MAX_TOTAL_HTTP_CONNECTIONS, so actual concurrency may stop scaling
linearly; keep the guidance brief and tie it to the existing parallel_files ×
threads explanation.
- Line 242: Clarify the `download-px-raw-files` protocol documentation so it
does not imply real `globus` or `s3` support. Update the protocol table entry in
`docs/usage.md` to reflect that the command only meaningfully supports `ftp` and
`aspera`, and note that the non-aspera path in `base.Provider.download_files`
does not implement a true fallback sequence for other protocols.

In `@pridepy/download/proteomexchange.py`:
- Around line 199-238: The iProX Aspera download path currently hardcodes the
default bandwidth limit because `download_from_accession_or_url` and the PX
download plumbing never pass a bandwidth value through to
`IproxProvider.aspera_download`. Add a `maximum_bandwidth` option to the PX
download entry points (including `Client.download_px_raw_files` and the
`download-px-raw-files` CLI), thread it through
`download_from_accession_or_url`, and pass it into
`IproxProvider.aspera_download` so users can override the default 100M cap; keep
the existing default unchanged for backward compatibility.

In `@pridepy/pridepy.py`:
- Around line 66-74: The same `--parallel-files/-w` option is duplicated across
multiple commands, so update `pridepy.py` to define a shared `click.option`
decorator (for example a reusable `parallel_files_option`) with the common
default, `IntRange`, and help text, then apply it to each command that currently
declares the option. Keep the existing `parallel_files` parameter name and
ensure all affected command functions use the shared decorator so future changes
only need one edit.

In `@pridepy/tests/test_download_resilience.py`:
- Around line 250-282: The current test only verifies the clamp when
`parallel_files` is high but the effective `workers` count stays below the
connection cap. Add a second assertion in
`test_download_http_urls_clamps_combined_connection_cap` that uses many URLs and
a large `parallel_files` value so `workers` itself exceeds
`transport.MAX_TOTAL_HTTP_CONNECTIONS`, and verify `download_http_urls` clamps
total connections appropriately (not just per-thread `download_threads`) while
still emitting the warning.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f054396-3341-4f94-889a-8434aeef2736

📥 Commits

Reviewing files that changed from the base of the PR and between 8546049 and 6dc27af.

📒 Files selected for processing (9)
  • docs/usage.md
  • pridepy/download/client.py
  • pridepy/download/iprox.py
  • pridepy/download/proteomexchange.py
  • pridepy/download/transport.py
  • pridepy/pridepy.py
  • pridepy/tests/test_cli_flatten.py
  • pridepy/tests/test_download_resilience.py
  • pridepy/tests/test_iprox_aspera.py

Comment thread pridepy/download/iprox.py
Comment on lines +74 to +75
path = urlparse(url).path.lstrip("/") # e.g. IPX.../.../a.raw
source = f"{user}@{cls.ASPERA_HOST}:{cls.ASPERA_ROOT}/{path}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the target file and nearby callers
git ls-files pridepy/download/iprox.py pridepy | sed -n '1,120p'
wc -l pridepy/download/iprox.py
cat -n pridepy/download/iprox.py | sed -n '1,220p'

# Find all references to aspera_download and ASPERA_HOST/ASPERA_ROOT
rg -n "aspera_download|ASPERA_HOST|ASPERA_ROOT|urlparse\\(" pridepy

# Look for any upstream URL validation in the PX router or related download entrypoints
rg -n "iprox|PX router|parse.*url|url.*host|hostname|scheme|..|safe path|aspera" pridepy

Repository: PRIDE-Archive/pridepy

Length of output: 50378


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the target file with line numbers
wc -l pridepy/download/iprox.py
cat -n pridepy/download/iprox.py | sed -n '1,220p'

# Show only the call sites and references around iProX/Aspera handling
rg -n -A4 -B4 "aspera_download|ASPERA_HOST|ASPERA_ROOT|urlparse\\(|iprox" pridepy | sed -n '1,260p'

# Narrowly inspect the downloader/router path if present
rg -n -A20 -B10 "class .*IPROX|def .*aspera|def .*download|router|PX" pridepy/download pridepy | sed -n '1,260p'

Repository: PRIDE-Archive/pridepy

Length of output: 44343


Validate the iProX URL before building the Aspera source path. IproxProvider.aspera_download() is public, but it ignores the parsed host/scheme and feeds urlparse(url).path straight into ascp (pridepy/download/iprox.py:74-75). A crafted URL with .. segments can escape the intended /data/iprox/<accession> subtree, so this needs host/scheme checks and path sanitization here, not only upstream.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/download/iprox.py` around lines 74 - 75, Validate the input in
IproxProvider.aspera_download() before constructing the Aspera source path:
ensure the parsed URL has the expected iProX scheme/host, and reject any path
containing traversal segments like “..”. Update the logic around
urlparse(url).path and the source string so only normalized paths under the
intended /data/iprox/<accession> subtree are accepted before calling ascp.

Source: Linters/SAST tools

Comment thread pridepy/download/iprox.py
Comment on lines +82 to +97
if (
skip_if_downloaded_already
and os.path.isfile(dest)
and os.path.getsize(dest) > 0
):
logging.info(f"Skipping download as file already exists: {dest}")
return None
argv = [
ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth,
"-k", "2", source, dest,
]
logging.info(
"Aspera: %s -> %s", source.replace(password, "***"), dest
)
try:
subprocess.run(argv, check=True, env=env)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Avoid marking partial Aspera outputs as complete.

If ascp writes a non-zero partial file and fails, the next --skip-if-downloaded-already run returns before -k 2 can resume it. Download to a partial target and atomically promote it only after success, or validate the remote size before skipping dest.

💾 Suggested partial-file flow
         if (
             skip_if_downloaded_already
             and os.path.isfile(dest)
             and os.path.getsize(dest) > 0
         ):
             logging.info(f"Skipping download as file already exists: {dest}")
             return None
+        transfer_dest = f"{dest}.part"
         argv = [
             ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth,
-            "-k", "2", source, dest,
+            "-k", "2", source, transfer_dest,
         ]
         logging.info(
             "Aspera: %s -> %s", source.replace(password, "***"), dest
         )
         try:
             subprocess.run(argv, check=True, env=env)
+            os.replace(transfer_dest, dest)
             return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (
skip_if_downloaded_already
and os.path.isfile(dest)
and os.path.getsize(dest) > 0
):
logging.info(f"Skipping download as file already exists: {dest}")
return None
argv = [
ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth,
"-k", "2", source, dest,
]
logging.info(
"Aspera: %s -> %s", source.replace(password, "***"), dest
)
try:
subprocess.run(argv, check=True, env=env)
if (
skip_if_downloaded_already
and os.path.isfile(dest)
and os.path.getsize(dest) > 0
):
logging.info(f"Skipping download as file already exists: {dest}")
return None
transfer_dest = f"{dest}.part"
argv = [
ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth,
"-k", "2", source, transfer_dest,
]
logging.info(
"Aspera: %s -> %s", source.replace(password, "***"), dest
)
try:
subprocess.run(argv, check=True, env=env)
os.replace(transfer_dest, dest)
return None
🧰 Tools
🪛 ast-grep (0.44.0)

[error] 96-96: Use of unsanitized data to create processes
Context: subprocess.run(argv, check=True, env=env)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').

(os-system-unsanitized-data)


[error] 96-96: Command coming from incoming request
Context: subprocess.run(argv, check=True, env=env)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').

(subprocess-from-request)

🪛 Ruff (0.15.20)

[error] 97-97: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/download/iprox.py` around lines 82 - 97, The skip logic in
`download`/`cls.download` treats any non-empty `dest` as complete, which can
incorrectly skip a failed partial Aspera transfer. Update the
`subprocess.run(argv, check=True, env=env)` flow so `ascp` writes to a
temporary/partial target and only promotes it to `dest` after success, or add a
remote-size/complete-file validation before the `os.path.isfile(dest)` skip
check. Keep the fix localized around the `skip_if_downloaded_already` block and
the Aspera command construction.

Comment thread pridepy/download/iprox.py
Comment on lines +134 to +136
workers = max(1, min(parallel_files, len(urls)))
if workers > 1:
with ThreadPoolExecutor(max_workers=workers) as executor:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Cap Aspera subprocess concurrency too.

parallel_files is user-controlled and each worker starts an ascp process. Large values can exhaust local processes/file descriptors and hammer the remote service; clamp it to a small provider-level maximum and log when reduced.

🚦 Suggested worker cap
+    MAX_ASPERA_PARALLEL_FILES: ClassVar[int] = 8
+
-        workers = max(1, min(parallel_files, len(urls)))
+        requested_workers = max(1, min(parallel_files, len(urls)))
+        workers = min(requested_workers, cls.MAX_ASPERA_PARALLEL_FILES)
+        if workers < requested_workers:
+            logging.warning(
+                "Reducing iProX Aspera parallel_files from %d to %d",
+                requested_workers,
+                workers,
+            )
         if workers > 1:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
workers = max(1, min(parallel_files, len(urls)))
if workers > 1:
with ThreadPoolExecutor(max_workers=workers) as executor:
requested_workers = max(1, min(parallel_files, len(urls)))
workers = min(requested_workers, cls.MAX_ASPERA_PARALLEL_FILES)
if workers < requested_workers:
logging.warning(
"Reducing iProX Aspera parallel_files from %d to %d",
requested_workers,
workers,
)
if workers > 1:
with ThreadPoolExecutor(max_workers=workers) as executor:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/download/iprox.py` around lines 134 - 136, The worker count used
before the ThreadPoolExecutor/ascp launch is only bounded by len(urls), so
user-controlled parallel_files can still create too many subprocesses. Add a
small provider-level maximum in the worker calculation around the
ThreadPoolExecutor block in pridepy/download/iprox.py, clamp workers to that
cap, and emit a log when the requested parallel_files value is reduced.

Comment on lines 852 to +865
workers = max(1, min(parallel_files, len(http_urls)))
if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS:
clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers)
logging.warning(
"parallel_files (%d) x download_threads (%d) = %d exceeds the "
"combined connection cap of %d; reducing download_threads to %d "
"to keep peak HTTP connections bounded.",
workers,
download_threads,
workers * download_threads,
MAX_TOTAL_HTTP_CONNECTIONS,
clamped_threads,
)
download_threads = clamped_threads

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Files of interest =="
git ls-files pridepy/download/transport.py pridepy.py pridepy | sed -n '1,200p'

echo
echo "== transport.py around target lines =="
nl -ba pridepy/download/transport.py | sed -n '820,900p'

echo
echo "== Search for cap-related symbols and entry points =="
rg -n "MAX_TOTAL_HTTP_CONNECTIONS|parallel_files|download_threads|workers|download_http_urls|argparse|click|typer|fire|--parallel|--threads|-w|-t" pridepy . -g '!**/.git/**'

Repository: PRIDE-Archive/pridepy

Length of output: 3029


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
from pathlib import Path

path = Path("pridepy/download/transport.py")
lines = path.read_text().splitlines()
for start, end in [(1,120), (820,900)]:
    print(f"== {path} lines {start}-{end} ==")
    for i in range(start, min(end, len(lines)) + 1):
        print(f"{i:4d}: {lines[i-1]}")
    print()
PY

echo "== Search CLI entry points and validations =="
python3 - <<'PY'
from pathlib import Path
import re

targets = [
    Path("pridepy/pridepy.py"),
    Path("pridepy/download/client.py"),
    Path("pridepy/download/base.py"),
    Path("pridepy/download/by_url.py"),
    Path("pridepy/tests/test_cli_flatten.py"),
]
for p in targets:
    if p.exists():
        print(f"\n## {p}")
        text = p.read_text().splitlines()
        for i, line in enumerate(text, 1):
            if re.search(r'parallel_files|download_threads|MAX_TOTAL_HTTP_CONNECTIONS|--parallel|--threads|-w\b|-t\b|argparse|click|typer|fire', line):
                print(f"{i:4d}: {line}")
PY

echo
echo "== Broader symbol search =="
rg -n "parallel_files|download_threads|MAX_TOTAL_HTTP_CONNECTIONS|download_http_urls|--parallel|--threads|-w\\b|-t\\b|argparse|click|typer|fire" pridepy -g '!**/.git/**'

Repository: PRIDE-Archive/pridepy

Length of output: 46710


Clamp workers as well as download_threads

click.IntRange(1, 32) protects the CLI, but direct Client.download_http_urls callers can still pass parallel_files > 64. In that case clamped_threads drops to 1 and the total HTTP connections can still exceed MAX_TOTAL_HTTP_CONNECTIONS. Bound workers too, or clamp the product from both sides.

🔧 Proposed fix
-    workers = max(1, min(parallel_files, len(http_urls)))
+    workers = max(1, min(parallel_files, len(http_urls), MAX_TOTAL_HTTP_CONNECTIONS))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
workers = max(1, min(parallel_files, len(http_urls)))
if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS:
clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers)
logging.warning(
"parallel_files (%d) x download_threads (%d) = %d exceeds the "
"combined connection cap of %d; reducing download_threads to %d "
"to keep peak HTTP connections bounded.",
workers,
download_threads,
workers * download_threads,
MAX_TOTAL_HTTP_CONNECTIONS,
clamped_threads,
)
download_threads = clamped_threads
workers = max(1, min(parallel_files, len(http_urls), MAX_TOTAL_HTTP_CONNECTIONS))
if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS:
clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers)
logging.warning(
"parallel_files (%d) x download_threads (%d) = %d exceeds the "
"combined connection cap of %d; reducing download_threads to %d "
"to keep peak HTTP connections bounded.",
workers,
download_threads,
workers * download_threads,
MAX_TOTAL_HTTP_CONNECTIONS,
clamped_threads,
)
download_threads = clamped_threads
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/download/transport.py` around lines 852 - 865, The connection cap
logic in Client.download_http_urls only clamps download_threads, so callers can
still exceed MAX_TOTAL_HTTP_CONNECTIONS when workers is very large. Update the
worker calculation near workers and clamped_threads to bound workers as well (or
otherwise clamp the workers x download_threads product from both sides) before
starting downloads, keeping the final total HTTP connections within
MAX_TOTAL_HTTP_CONNECTIONS.

Comment thread pridepy/pridepy.py
Comment on lines 404 to +421
def download_px_raw_files(
accession: str,
output_folder: str,
skip_if_downloaded_already: bool,
protocol: str = "ftp",
download_threads: int = 1,
parallel_files: int = 1,
preserve_structure: bool = False,
iprox_user: Optional[str] = None,
):
"""CLI wrapper to download raw files via ProteomeXchange XML."""
files = Files()
logging.info(f"PX accession/URL: {accession}")

password = os.environ.get("IPROX_ASPERA_PASSWORD")
if protocol.lower() == "aspera" and not password:
password = click.prompt("iProX Aspera password", hide_input=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Password prompt fires even when --iprox-user was never provided.

If protocol == "aspera" and IPROX_ASPERA_PASSWORD isn't set, click.prompt runs unconditionally — even if iprox_user is None. The user is prompted for a password only to hit IproxProvider.aspera_download's credential-validation error afterward. Fail fast before prompting.

🛡️ Proposed fix
     password = os.environ.get("IPROX_ASPERA_PASSWORD")
-    if protocol.lower() == "aspera" and not password:
+    if protocol.lower() == "aspera" and not iprox_user:
+        raise click.UsageError("--iprox-user is required with --protocol aspera.")
+    if protocol.lower() == "aspera" and not password:
         password = click.prompt("iProX Aspera password", hide_input=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def download_px_raw_files(
accession: str,
output_folder: str,
skip_if_downloaded_already: bool,
protocol: str = "ftp",
download_threads: int = 1,
parallel_files: int = 1,
preserve_structure: bool = False,
iprox_user: Optional[str] = None,
):
"""CLI wrapper to download raw files via ProteomeXchange XML."""
files = Files()
logging.info(f"PX accession/URL: {accession}")
password = os.environ.get("IPROX_ASPERA_PASSWORD")
if protocol.lower() == "aspera" and not password:
password = click.prompt("iProX Aspera password", hide_input=True)
def download_px_raw_files(
accession: str,
output_folder: str,
skip_if_downloaded_already: bool,
protocol: str = "ftp",
download_threads: int = 1,
parallel_files: int = 1,
preserve_structure: bool = False,
iprox_user: Optional[str] = None,
):
"""CLI wrapper to download raw files via ProteomeXchange XML."""
files = Files()
logging.info(f"PX accession/URL: {accession}")
password = os.environ.get("IPROX_ASPERA_PASSWORD")
if protocol.lower() == "aspera" and not iprox_user:
raise click.UsageError("--iprox-user is required with --protocol aspera.")
if protocol.lower() == "aspera" and not password:
password = click.prompt("iProX Aspera password", hide_input=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/pridepy.py` around lines 404 - 421, The password prompt path in
download_px_raw_files is running even when iprox_user was never provided. Update
the aspera branch in download_px_raw_files to validate iprox_user before calling
click.prompt, and fail fast with the existing credential error path if it is
missing. Keep the fix localized to the download_px_raw_files flow and its
handoff to IproxProvider.aspera_download so the prompt only appears when a user
can actually authenticate.

password="secret",
maximum_bandwidth="100M",
)
args, kwargs = run.call_args

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Clean up the Ruff warnings in the new tests.

Use a dummy name for the unused unpacked value and escape the literal dot in the pytest.raises(..., match=...) pattern.

🧹 Suggested lint fixes
-        args, kwargs = run.call_args
+        args, _kwargs = run.call_args
         argv = args[0]
-            with pytest.raises(RuntimeError, match="b.raw"):
+            with pytest.raises(RuntimeError, match=re.escape("b.raw")):

Also applies to: 166-166

🧰 Tools
🪛 Ruff (0.15.20)

[warning] 143-143: Unpacked variable kwargs is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/tests/test_iprox_aspera.py` at line 143, Clean up the Ruff warnings
in the new tests by fixing the unused tuple unpack from run.call_args in the
test that inspects the mock invocation: assign the unused value to a dummy name
(for example in the assertion around run.call_args). Also update the
pytest.raises(..., match=...) pattern in the affected test(s) to escape the
literal dot so the regex matches the exact error text instead of treating the
dot as a wildcard.

Source: Linters/SAST tools

Comment on lines +279 to +281
rels = asp.call_args.kwargs["relative_paths"]
# Flattened + de-duped basenames, no subdirectories preserved.
assert set(rels) == {"a.raw", "a_1.raw"}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Assert flattened paths in order.

relative_paths is positional with urls; using a set would miss a swapped mapping.

✅ Suggested assertion
         rels = asp.call_args.kwargs["relative_paths"]
         # Flattened + de-duped basenames, no subdirectories preserved.
-        assert set(rels) == {"a.raw", "a_1.raw"}
+        assert rels == ["a.raw", "a_1.raw"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rels = asp.call_args.kwargs["relative_paths"]
# Flattened + de-duped basenames, no subdirectories preserved.
assert set(rels) == {"a.raw", "a_1.raw"}
rels = asp.call_args.kwargs["relative_paths"]
# Flattened + de-duped basenames, no subdirectories preserved.
assert rels == ["a.raw", "a_1.raw"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/tests/test_iprox_aspera.py` around lines 279 - 281, The test in
test_iprox_aspera should assert the order of relative_paths instead of
converting to a set, since its mapping to urls is positional and a swapped
pairing would be missed. Update the assertion near
asp.call_args.kwargs["relative_paths"] to compare the flattened, de-duped
basenames in sequence, preserving the existing no-subdirectories expectation
while checking the exact order.

@qodo-code-review

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📜 Skill insights (0)

Grey Divider


Remediation recommended

1. Aspera prompt can hang 🐞 Bug ☼ Reliability
Description
download_px_raw_files prompts for an iProX Aspera password as soon as --protocol aspera is set
and IPROX_ASPERA_PASSWORD is missing, before verifying --iprox-user is present or that the
dataset contains iProX-hosted files. This can block non-interactive executions and unnecessarily
asks for a password even when the code will later error out.
Code

pridepy/pridepy.py[R418-420]

+    password = os.environ.get("IPROX_ASPERA_PASSWORD")
+    if protocol.lower() == "aspera" and not password:
+        password = click.prompt("iProX Aspera password", hide_input=True)
Evidence
The CLI unconditionally prompts when --protocol aspera is set and no env password is present, but
the actual Aspera path later rejects missing credentials and can also reject datasets with no
iProX-hosted files—both after the prompt would already have occurred.

pridepy/pridepy.py[418-432]
pridepy/download/iprox.py[123-128]
pridepy/download/proteomexchange.py[199-211]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The CLI prompts for the iProX Aspera password too early (immediately on `--protocol aspera`), which can hang in non-interactive contexts and can prompt even if the dataset has no iProX-hosted files or `--iprox-user` is missing.

## Issue Context
- Aspera mode is only valid if (a) the dataset has iProX-hosted files and (b) credentials are provided.
- Current flow prompts before either condition is validated.

## Fix Focus Areas
- pridepy/pridepy.py[418-431]
- pridepy/download/proteomexchange.py[199-238]

### Suggested fix
1) In the CLI handler, if `protocol.lower() == "aspera"` and `iprox_user` is missing, fail fast with a `click.UsageError` (no prompt).
2) Avoid prompting until you know Aspera will be used:
  - Option A (preferred): do a lightweight preflight listing (PX XML fetch) to confirm at least one file is hosted on `download.iprox.org`; if none, error without prompting.
  - Option B: call the downloader with `iprox_password=None` and, on the specific "missing credentials" error, prompt only if stdin is a TTY; otherwise raise a clear error instructing to set `IPROX_ASPERA_PASSWORD`.
3) If prompting is needed, only prompt when stdin is interactive; otherwise require `IPROX_ASPERA_PASSWORD`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational

2. iProX creds text ambiguous 🐞 Bug ⚙ Maintainability
Description
The iProX repository description still says pridepy uses the public HTTP endpoint so no iProX
credentials are required, without clarifying that credentials are required when opting into the new
--protocol aspera mode. This can confuse users about when --iprox-user/IPROX_ASPERA_PASSWORD
are needed.
Code

docs/usage.md[323]

+- **iProX** fetches the dataset's ProteomeXchange XML from `http://download.iprox.org/<accession>/PX_<accession>.xml`, then downloads each referenced file from the same host over anonymous HTTP (with `Range` support for resume). iProX also exposes Aspera with username/password for very large bulk transfers; `pridepy` uses the public HTTP endpoint so no iProX credentials are required.
Evidence
The docs explicitly document credentialed iProX Aspera usage, but the iProX bullet later states
(without qualification) that pridepy uses HTTP and no credentials are required, which becomes
ambiguous after adding Aspera support.

docs/usage.md[279-294]
docs/usage.md[323-323]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The iProX docs section implies no iProX credentials are required, but the same document also introduces optional credentialed Aspera mode. The repo description should distinguish default HTTP behavior from opt-in Aspera behavior.

## Issue Context
Users reading the "How each repository is enumerated" section may miss the earlier Aspera section and assume credentials are never needed.

## Fix Focus Areas
- docs/usage.md[279-294]
- docs/usage.md[323-323]

### Suggested fix
Rewrite the iProX bullet to something like:
"... downloads over anonymous HTTP by default (no credentials required). iProX also supports opt-in Aspera transfers (`--protocol aspera`), which require an iProX account (`--iprox-user` and `IPROX_ASPERA_PASSWORD`)."

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread pridepy/pridepy.py
Comment on lines +418 to +420
password = os.environ.get("IPROX_ASPERA_PASSWORD")
if protocol.lower() == "aspera" and not password:
password = click.prompt("iProX Aspera password", hide_input=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remediation recommended

1. Aspera prompt can hang 🐞 Bug ☼ Reliability

download_px_raw_files prompts for an iProX Aspera password as soon as --protocol aspera is set
and IPROX_ASPERA_PASSWORD is missing, before verifying --iprox-user is present or that the
dataset contains iProX-hosted files. This can block non-interactive executions and unnecessarily
asks for a password even when the code will later error out.
Agent Prompt
## Issue description
The CLI prompts for the iProX Aspera password too early (immediately on `--protocol aspera`), which can hang in non-interactive contexts and can prompt even if the dataset has no iProX-hosted files or `--iprox-user` is missing.

## Issue Context
- Aspera mode is only valid if (a) the dataset has iProX-hosted files and (b) credentials are provided.
- Current flow prompts before either condition is validated.

## Fix Focus Areas
- pridepy/pridepy.py[418-431]
- pridepy/download/proteomexchange.py[199-238]

### Suggested fix
1) In the CLI handler, if `protocol.lower() == "aspera"` and `iprox_user` is missing, fail fast with a `click.UsageError` (no prompt).
2) Avoid prompting until you know Aspera will be used:
   - Option A (preferred): do a lightweight preflight listing (PX XML fetch) to confirm at least one file is hosted on `download.iprox.org`; if none, error without prompting.
   - Option B: call the downloader with `iprox_password=None` and, on the specific "missing credentials" error, prompt only if stdin is a TTY; otherwise raise a clear error instructing to set `IPROX_ASPERA_PASSWORD`.
3) If prompting is needed, only prompt when stdin is interactive; otherwise require `IPROX_ASPERA_PASSWORD`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread docs/usage.md
- **MassIVE** walks the FTPS tree at `massive-ftp.ucsd.edu` (the server requires TLS). MassIVE distributes datasets across versioned root directories (`/v01`–`/vNN`); `pridepy` discovers the correct root automatically. If FTP/FTPS is blocked by the network, `pridepy` falls back to HTTPS: it lists the dataset from the GNPS2 file index (`datasetcache.gnps2.org`) and downloads each file from the ProteoSAFe endpoint at `massive.ucsd.edu` (byte-identical to the FTPS copy).
- **JPOST** lists files through the JSON PROXI endpoint at `https://repository.jpostdb.org/proxi/datasets/<JPSTxxxxxx>` and downloads from `ftp.jpostdb.org` over plain FTP. The PROXI listing avoids the source-IP connection limit JPOST enforces on FTP.
- **iProX** fetches the dataset's ProteomeXchange XML from `http://download.iprox.org/<accession>/PX_<accession>.xml`, then downloads each referenced file from the same host over anonymous HTTP (with `Range` support for resume). iProX also exposes Aspera (`faspe://`) with username/password for very large bulk transfers; `pridepy` uses the public HTTP endpoint so no iProX credentials are required.
- **iProX** fetches the dataset's ProteomeXchange XML from `http://download.iprox.org/<accession>/PX_<accession>.xml`, then downloads each referenced file from the same host over anonymous HTTP (with `Range` support for resume). iProX also exposes Aspera with username/password for very large bulk transfers; `pridepy` uses the public HTTP endpoint so no iProX credentials are required.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Informational

2. Iprox creds text ambiguous 🐞 Bug ⚙ Maintainability

The iProX repository description still says pridepy uses the public HTTP endpoint so no iProX
credentials are required, without clarifying that credentials are required when opting into the new
--protocol aspera mode. This can confuse users about when --iprox-user/IPROX_ASPERA_PASSWORD
are needed.
Agent Prompt
## Issue description
The iProX docs section implies no iProX credentials are required, but the same document also introduces optional credentialed Aspera mode. The repo description should distinguish default HTTP behavior from opt-in Aspera behavior.

## Issue Context
Users reading the "How each repository is enumerated" section may miss the earlier Aspera section and assume credentials are never needed.

## Fix Focus Areas
- docs/usage.md[279-294]
- docs/usage.md[323-323]

### Suggested fix
Rewrite the iProX bullet to something like:
"... downloads over anonymous HTTP by default (no credentials required). iProX also supports opt-in Aspera transfers (`--protocol aspera`), which require an iProX account (`--iprox-user` and `IPROX_ASPERA_PASSWORD`)."

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants