Conversation
…ownload-px-raw-files (protocol/threads/parallel)
…-files Adds IproxProvider.aspera_download (ascp on port 33001, credentials via ASPERA_SCP_PASS env only) and routes ProteomeXchangeProvider through it when --protocol aspera is selected, gated on iProX-hosted URLs. Plumbs --iprox-user/--iprox-password (env: IPROX_USER/IPROX_ASPERA_PASSWORD) through Client.download_px_raw_files and the CLI. HTTP-parallel remains the default transport.
… host, honor -w) + concurrency cap + hidden password prompt
Faster iProX/ProteomeXchange downloads: expose -w parallel files, wire download-px-raw-files, opt-in iProX Aspera
📝 WalkthroughWalkthroughThis PR adds Aspera-based fast downloads for iProX datasets, introduces ChangesiProX Aspera and parallel download plumbing
Estimated code review effort: 4 (Complex) | ~60 minutes Possibly related PRs
Suggested reviewers: 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR upgrades pridepy’s bulk-download capabilities by adding across-file parallelism (--parallel-files) and integrating iProX Aspera transfers for faster ProteomeXchange (PX) downloads, with corresponding CLI, transport-layer, and documentation updates.
Changes:
- Add
--parallel-files/-wplumbing through CLI → client → provider → transport, plus a combined HTTP connection cap. - Add iProX Aspera support (ascp command construction, credential handling, and PX routing when
--protocol asperais selected). - Expand/adjust tests and usage documentation for parallelism and Aspera.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pridepy/tests/test_iprox_aspera.py | New tests for iProX Aspera command/env construction, routing, flattening, and safety checks. |
| pridepy/tests/test_download_resilience.py | Adds a regression test ensuring the combined connection cap clamps per-file threads. |
| pridepy/tests/test_cli_flatten.py | Updates CLI tests to assert parallel_files default/flag behavior across commands. |
| pridepy/pridepy.py | Adds CLI flags for --parallel-files, PX --protocol, and iProX credential handling/prompting. |
| pridepy/download/transport.py | Introduces MAX_TOTAL_HTTP_CONNECTIONS and clamps HTTP concurrency. |
| pridepy/download/proteomexchange.py | Routes PX downloads to iProX Aspera when requested; supports flattening in that branch. |
| pridepy/download/iprox.py | Implements iProX Aspera transfer execution (including parallel subprocess runs) and safe destination handling. |
| pridepy/download/client.py | Propagates new PX download parameters through the client API. |
| docs/usage.md | Documents new PX CLI options, parallelism guidance, and iProX Aspera password handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| failed: List[str] = [] | ||
| workers = max(1, min(parallel_files, len(http_urls))) | ||
| if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS: | ||
| clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers) | ||
| logging.warning( | ||
| "parallel_files (%d) x download_threads (%d) = %d exceeds the " | ||
| "combined connection cap of %d; reducing download_threads to %d " | ||
| "to keep peak HTTP connections bounded.", | ||
| workers, | ||
| download_threads, | ||
| workers * download_threads, | ||
| MAX_TOTAL_HTTP_CONNECTIONS, | ||
| clamped_threads, | ||
| ) | ||
| download_threads = clamped_threads | ||
| if workers > 1: |
| password = os.environ.get("IPROX_ASPERA_PASSWORD") | ||
| if protocol.lower() == "aspera" and not password: | ||
| password = click.prompt("iProX Aspera password", hide_input=True) | ||
|
|
PR Summary by QodoSpeed up PX/iProX downloads with parallelism and opt-in Aspera support
AI Description
Diagram
High-Level Assessment
Files changed (9)
|
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (5)
docs/usage.md (2)
253-257: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winConsider noting the internal concurrency cap.
transport.download_http_urlsclampsdownload_threadswhenparallel_files × download_threadsexceeds an internal cap (seeMAX_TOTAL_HTTP_CONNECTIONS), so actual concurrency won't scale linearly past that point. A brief caveat here would prevent user confusion when e.g.-w 32 -t 32doesn't yield 1024 real connections.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/usage.md` around lines 253 - 257, Update the “Fast downloads” section in usage.md to mention the internal concurrency limit enforced by transport.download_http_urls. Note that download_threads is clamped when parallel_files × download_threads exceeds MAX_TOTAL_HTTP_CONNECTIONS, so actual concurrency may stop scaling linearly; keep the guidance brief and tie it to the existing parallel_files × threads explanation.
242-242: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winProtocol table overstates globus/s3 support for
download-px-raw-files.For this command, only
ftpandasperaare actually implemented —base.Provider.download_files(used for the non-aspera path) just warns and falls through to whatever URL scheme is in the PX XML ifglobus/s3is requested; there's no genuine multi-protocol fallback here (unlike PRIDE's_protocol_sequence). Consider clarifying thatglobus/s3aren't meaningfully supported for PX/iProX downloads to avoid confusing users.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/usage.md` at line 242, Clarify the `download-px-raw-files` protocol documentation so it does not imply real `globus` or `s3` support. Update the protocol table entry in `docs/usage.md` to reflect that the command only meaningfully supports `ftp` and `aspera`, and note that the non-aspera path in `base.Provider.download_files` does not implement a true fallback sequence for other protocols.pridepy/pridepy.py (1)
66-74: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDuplicate
--parallel-files/-woption definition across 5 commands.Same default, range, and help text repeated verbatim. Consider extracting a shared decorator (e.g.
parallel_files_option = click.option("-w", "--parallel-files", "parallel_files", default=1, type=click.IntRange(1, 32), help=...)) and applying it via@parallel_files_optionat each call site, so future tweaks to bounds/wording only need one edit.Also applies to: 179-187, 377-385, 681-689, 783-791
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/pridepy.py` around lines 66 - 74, The same `--parallel-files/-w` option is duplicated across multiple commands, so update `pridepy.py` to define a shared `click.option` decorator (for example a reusable `parallel_files_option`) with the common default, `IntRange`, and help text, then apply it to each command that currently declares the option. Keep the existing `parallel_files` parameter name and ensure all affected command functions use the shared decorator so future changes only need one edit.pridepy/download/proteomexchange.py (1)
199-238: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winiProX Aspera path has no bandwidth-throttling control.
IproxProvider.aspera_downloadacceptsmaximum_bandwidth(default"100M"), but neitherdownload_from_accession_or_urlnorClient.download_px_raw_filesnor thedownload-px-raw-filesCLI expose a way to override it — every PX-driven iProX Aspera transfer is capped at 100M regardless of user intent. Given Aspera is specifically pitched for "very large bulk transfers," bandwidth control is a reasonable expectation.♻️ Sketch of the plumbing needed
def download_from_accession_or_url( self, px_id_or_url: str, output_folder: str, skip_if_downloaded_already: bool = True, flatten: bool = True, parallel_files: int = 1, download_threads: int = 1, protocol: str = "ftp", + aspera_maximum_bandwidth: str = "100M", iprox_user: Optional[str] = None, iprox_password: Optional[str] = None, ) -> None: ... IproxProvider.aspera_download( urls=iprox_urls, output_folder=output_folder, relative_paths=dest_rels, user=iprox_user, password=iprox_password, + maximum_bandwidth=aspera_maximum_bandwidth, skip_if_downloaded_already=skip_if_downloaded_already, parallel_files=parallel_files, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/download/proteomexchange.py` around lines 199 - 238, The iProX Aspera download path currently hardcodes the default bandwidth limit because `download_from_accession_or_url` and the PX download plumbing never pass a bandwidth value through to `IproxProvider.aspera_download`. Add a `maximum_bandwidth` option to the PX download entry points (including `Client.download_px_raw_files` and the `download-px-raw-files` CLI), thread it through `download_from_accession_or_url`, and pass it into `IproxProvider.aspera_download` so users can override the default 100M cap; keep the existing default unchanged for backward compatibility.pridepy/tests/test_download_resilience.py (1)
250-282: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winGood coverage of the intended clamp path; consider adding a case where
workersalone exceeds the cap.This test only exercises
parallel_files=32with 3 URLs (workers=3), which stays well underMAX_TOTAL_HTTP_CONNECTIONS. It doesn't cover the scenario wherelen(http_urls)andparallel_filesare both large enough thatworkersitself exceeds the cap (see the corresponding comment ontransport.pylines 852-865) — in that casedownload_threadsfloors to 1 but total connections can still exceed 64. Adding a test with e.g. 100 URLs andparallel_files=100would catch that gap once fixed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/tests/test_download_resilience.py` around lines 250 - 282, The current test only verifies the clamp when `parallel_files` is high but the effective `workers` count stays below the connection cap. Add a second assertion in `test_download_http_urls_clamps_combined_connection_cap` that uses many URLs and a large `parallel_files` value so `workers` itself exceeds `transport.MAX_TOTAL_HTTP_CONNECTIONS`, and verify `download_http_urls` clamps total connections appropriately (not just per-thread `download_threads`) while still emitting the warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pridepy/download/iprox.py`:
- Around line 82-97: The skip logic in `download`/`cls.download` treats any
non-empty `dest` as complete, which can incorrectly skip a failed partial Aspera
transfer. Update the `subprocess.run(argv, check=True, env=env)` flow so `ascp`
writes to a temporary/partial target and only promotes it to `dest` after
success, or add a remote-size/complete-file validation before the
`os.path.isfile(dest)` skip check. Keep the fix localized around the
`skip_if_downloaded_already` block and the Aspera command construction.
- Around line 74-75: Validate the input in IproxProvider.aspera_download()
before constructing the Aspera source path: ensure the parsed URL has the
expected iProX scheme/host, and reject any path containing traversal segments
like “..”. Update the logic around urlparse(url).path and the source string so
only normalized paths under the intended /data/iprox/<accession> subtree are
accepted before calling ascp.
- Around line 134-136: The worker count used before the ThreadPoolExecutor/ascp
launch is only bounded by len(urls), so user-controlled parallel_files can still
create too many subprocesses. Add a small provider-level maximum in the worker
calculation around the ThreadPoolExecutor block in pridepy/download/iprox.py,
clamp workers to that cap, and emit a log when the requested parallel_files
value is reduced.
In `@pridepy/download/transport.py`:
- Around line 852-865: The connection cap logic in Client.download_http_urls
only clamps download_threads, so callers can still exceed
MAX_TOTAL_HTTP_CONNECTIONS when workers is very large. Update the worker
calculation near workers and clamped_threads to bound workers as well (or
otherwise clamp the workers x download_threads product from both sides) before
starting downloads, keeping the final total HTTP connections within
MAX_TOTAL_HTTP_CONNECTIONS.
In `@pridepy/pridepy.py`:
- Around line 404-421: The password prompt path in download_px_raw_files is
running even when iprox_user was never provided. Update the aspera branch in
download_px_raw_files to validate iprox_user before calling click.prompt, and
fail fast with the existing credential error path if it is missing. Keep the fix
localized to the download_px_raw_files flow and its handoff to
IproxProvider.aspera_download so the prompt only appears when a user can
actually authenticate.
In `@pridepy/tests/test_iprox_aspera.py`:
- Line 143: Clean up the Ruff warnings in the new tests by fixing the unused
tuple unpack from run.call_args in the test that inspects the mock invocation:
assign the unused value to a dummy name (for example in the assertion around
run.call_args). Also update the pytest.raises(..., match=...) pattern in the
affected test(s) to escape the literal dot so the regex matches the exact error
text instead of treating the dot as a wildcard.
- Around line 279-281: The test in test_iprox_aspera should assert the order of
relative_paths instead of converting to a set, since its mapping to urls is
positional and a swapped pairing would be missed. Update the assertion near
asp.call_args.kwargs["relative_paths"] to compare the flattened, de-duped
basenames in sequence, preserving the existing no-subdirectories expectation
while checking the exact order.
---
Nitpick comments:
In `@docs/usage.md`:
- Around line 253-257: Update the “Fast downloads” section in usage.md to
mention the internal concurrency limit enforced by transport.download_http_urls.
Note that download_threads is clamped when parallel_files × download_threads
exceeds MAX_TOTAL_HTTP_CONNECTIONS, so actual concurrency may stop scaling
linearly; keep the guidance brief and tie it to the existing parallel_files ×
threads explanation.
- Line 242: Clarify the `download-px-raw-files` protocol documentation so it
does not imply real `globus` or `s3` support. Update the protocol table entry in
`docs/usage.md` to reflect that the command only meaningfully supports `ftp` and
`aspera`, and note that the non-aspera path in `base.Provider.download_files`
does not implement a true fallback sequence for other protocols.
In `@pridepy/download/proteomexchange.py`:
- Around line 199-238: The iProX Aspera download path currently hardcodes the
default bandwidth limit because `download_from_accession_or_url` and the PX
download plumbing never pass a bandwidth value through to
`IproxProvider.aspera_download`. Add a `maximum_bandwidth` option to the PX
download entry points (including `Client.download_px_raw_files` and the
`download-px-raw-files` CLI), thread it through
`download_from_accession_or_url`, and pass it into
`IproxProvider.aspera_download` so users can override the default 100M cap; keep
the existing default unchanged for backward compatibility.
In `@pridepy/pridepy.py`:
- Around line 66-74: The same `--parallel-files/-w` option is duplicated across
multiple commands, so update `pridepy.py` to define a shared `click.option`
decorator (for example a reusable `parallel_files_option`) with the common
default, `IntRange`, and help text, then apply it to each command that currently
declares the option. Keep the existing `parallel_files` parameter name and
ensure all affected command functions use the shared decorator so future changes
only need one edit.
In `@pridepy/tests/test_download_resilience.py`:
- Around line 250-282: The current test only verifies the clamp when
`parallel_files` is high but the effective `workers` count stays below the
connection cap. Add a second assertion in
`test_download_http_urls_clamps_combined_connection_cap` that uses many URLs and
a large `parallel_files` value so `workers` itself exceeds
`transport.MAX_TOTAL_HTTP_CONNECTIONS`, and verify `download_http_urls` clamps
total connections appropriately (not just per-thread `download_threads`) while
still emitting the warning.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6f054396-3341-4f94-889a-8434aeef2736
📒 Files selected for processing (9)
docs/usage.mdpridepy/download/client.pypridepy/download/iprox.pypridepy/download/proteomexchange.pypridepy/download/transport.pypridepy/pridepy.pypridepy/tests/test_cli_flatten.pypridepy/tests/test_download_resilience.pypridepy/tests/test_iprox_aspera.py
| path = urlparse(url).path.lstrip("/") # e.g. IPX.../.../a.raw | ||
| source = f"{user}@{cls.ASPERA_HOST}:{cls.ASPERA_ROOT}/{path}" |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the target file and nearby callers
git ls-files pridepy/download/iprox.py pridepy | sed -n '1,120p'
wc -l pridepy/download/iprox.py
cat -n pridepy/download/iprox.py | sed -n '1,220p'
# Find all references to aspera_download and ASPERA_HOST/ASPERA_ROOT
rg -n "aspera_download|ASPERA_HOST|ASPERA_ROOT|urlparse\\(" pridepy
# Look for any upstream URL validation in the PX router or related download entrypoints
rg -n "iprox|PX router|parse.*url|url.*host|hostname|scheme|..|safe path|aspera" pridepyRepository: PRIDE-Archive/pridepy
Length of output: 50378
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the target file with line numbers
wc -l pridepy/download/iprox.py
cat -n pridepy/download/iprox.py | sed -n '1,220p'
# Show only the call sites and references around iProX/Aspera handling
rg -n -A4 -B4 "aspera_download|ASPERA_HOST|ASPERA_ROOT|urlparse\\(|iprox" pridepy | sed -n '1,260p'
# Narrowly inspect the downloader/router path if present
rg -n -A20 -B10 "class .*IPROX|def .*aspera|def .*download|router|PX" pridepy/download pridepy | sed -n '1,260p'Repository: PRIDE-Archive/pridepy
Length of output: 44343
Validate the iProX URL before building the Aspera source path. IproxProvider.aspera_download() is public, but it ignores the parsed host/scheme and feeds urlparse(url).path straight into ascp (pridepy/download/iprox.py:74-75). A crafted URL with .. segments can escape the intended /data/iprox/<accession> subtree, so this needs host/scheme checks and path sanitization here, not only upstream.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/download/iprox.py` around lines 74 - 75, Validate the input in
IproxProvider.aspera_download() before constructing the Aspera source path:
ensure the parsed URL has the expected iProX scheme/host, and reject any path
containing traversal segments like “..”. Update the logic around
urlparse(url).path and the source string so only normalized paths under the
intended /data/iprox/<accession> subtree are accepted before calling ascp.
Source: Linters/SAST tools
| if ( | ||
| skip_if_downloaded_already | ||
| and os.path.isfile(dest) | ||
| and os.path.getsize(dest) > 0 | ||
| ): | ||
| logging.info(f"Skipping download as file already exists: {dest}") | ||
| return None | ||
| argv = [ | ||
| ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth, | ||
| "-k", "2", source, dest, | ||
| ] | ||
| logging.info( | ||
| "Aspera: %s -> %s", source.replace(password, "***"), dest | ||
| ) | ||
| try: | ||
| subprocess.run(argv, check=True, env=env) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Avoid marking partial Aspera outputs as complete.
If ascp writes a non-zero partial file and fails, the next --skip-if-downloaded-already run returns before -k 2 can resume it. Download to a partial target and atomically promote it only after success, or validate the remote size before skipping dest.
💾 Suggested partial-file flow
if (
skip_if_downloaded_already
and os.path.isfile(dest)
and os.path.getsize(dest) > 0
):
logging.info(f"Skipping download as file already exists: {dest}")
return None
+ transfer_dest = f"{dest}.part"
argv = [
ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth,
- "-k", "2", source, dest,
+ "-k", "2", source, transfer_dest,
]
logging.info(
"Aspera: %s -> %s", source.replace(password, "***"), dest
)
try:
subprocess.run(argv, check=True, env=env)
+ os.replace(transfer_dest, dest)
return None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ( | |
| skip_if_downloaded_already | |
| and os.path.isfile(dest) | |
| and os.path.getsize(dest) > 0 | |
| ): | |
| logging.info(f"Skipping download as file already exists: {dest}") | |
| return None | |
| argv = [ | |
| ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth, | |
| "-k", "2", source, dest, | |
| ] | |
| logging.info( | |
| "Aspera: %s -> %s", source.replace(password, "***"), dest | |
| ) | |
| try: | |
| subprocess.run(argv, check=True, env=env) | |
| if ( | |
| skip_if_downloaded_already | |
| and os.path.isfile(dest) | |
| and os.path.getsize(dest) > 0 | |
| ): | |
| logging.info(f"Skipping download as file already exists: {dest}") | |
| return None | |
| transfer_dest = f"{dest}.part" | |
| argv = [ | |
| ascp, "-QT", "-P", cls.ASPERA_PORT, "-l", maximum_bandwidth, | |
| "-k", "2", source, transfer_dest, | |
| ] | |
| logging.info( | |
| "Aspera: %s -> %s", source.replace(password, "***"), dest | |
| ) | |
| try: | |
| subprocess.run(argv, check=True, env=env) | |
| os.replace(transfer_dest, dest) | |
| return None |
🧰 Tools
🪛 ast-grep (0.44.0)
[error] 96-96: Use of unsanitized data to create processes
Context: subprocess.run(argv, check=True, env=env)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').
(os-system-unsanitized-data)
[error] 96-96: Command coming from incoming request
Context: subprocess.run(argv, check=True, env=env)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').
(subprocess-from-request)
🪛 Ruff (0.15.20)
[error] 97-97: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/download/iprox.py` around lines 82 - 97, The skip logic in
`download`/`cls.download` treats any non-empty `dest` as complete, which can
incorrectly skip a failed partial Aspera transfer. Update the
`subprocess.run(argv, check=True, env=env)` flow so `ascp` writes to a
temporary/partial target and only promotes it to `dest` after success, or add a
remote-size/complete-file validation before the `os.path.isfile(dest)` skip
check. Keep the fix localized around the `skip_if_downloaded_already` block and
the Aspera command construction.
| workers = max(1, min(parallel_files, len(urls))) | ||
| if workers > 1: | ||
| with ThreadPoolExecutor(max_workers=workers) as executor: |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Cap Aspera subprocess concurrency too.
parallel_files is user-controlled and each worker starts an ascp process. Large values can exhaust local processes/file descriptors and hammer the remote service; clamp it to a small provider-level maximum and log when reduced.
🚦 Suggested worker cap
+ MAX_ASPERA_PARALLEL_FILES: ClassVar[int] = 8
+
- workers = max(1, min(parallel_files, len(urls)))
+ requested_workers = max(1, min(parallel_files, len(urls)))
+ workers = min(requested_workers, cls.MAX_ASPERA_PARALLEL_FILES)
+ if workers < requested_workers:
+ logging.warning(
+ "Reducing iProX Aspera parallel_files from %d to %d",
+ requested_workers,
+ workers,
+ )
if workers > 1:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| workers = max(1, min(parallel_files, len(urls))) | |
| if workers > 1: | |
| with ThreadPoolExecutor(max_workers=workers) as executor: | |
| requested_workers = max(1, min(parallel_files, len(urls))) | |
| workers = min(requested_workers, cls.MAX_ASPERA_PARALLEL_FILES) | |
| if workers < requested_workers: | |
| logging.warning( | |
| "Reducing iProX Aspera parallel_files from %d to %d", | |
| requested_workers, | |
| workers, | |
| ) | |
| if workers > 1: | |
| with ThreadPoolExecutor(max_workers=workers) as executor: |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/download/iprox.py` around lines 134 - 136, The worker count used
before the ThreadPoolExecutor/ascp launch is only bounded by len(urls), so
user-controlled parallel_files can still create too many subprocesses. Add a
small provider-level maximum in the worker calculation around the
ThreadPoolExecutor block in pridepy/download/iprox.py, clamp workers to that
cap, and emit a log when the requested parallel_files value is reduced.
| workers = max(1, min(parallel_files, len(http_urls))) | ||
| if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS: | ||
| clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers) | ||
| logging.warning( | ||
| "parallel_files (%d) x download_threads (%d) = %d exceeds the " | ||
| "combined connection cap of %d; reducing download_threads to %d " | ||
| "to keep peak HTTP connections bounded.", | ||
| workers, | ||
| download_threads, | ||
| workers * download_threads, | ||
| MAX_TOTAL_HTTP_CONNECTIONS, | ||
| clamped_threads, | ||
| ) | ||
| download_threads = clamped_threads |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Files of interest =="
git ls-files pridepy/download/transport.py pridepy.py pridepy | sed -n '1,200p'
echo
echo "== transport.py around target lines =="
nl -ba pridepy/download/transport.py | sed -n '820,900p'
echo
echo "== Search for cap-related symbols and entry points =="
rg -n "MAX_TOTAL_HTTP_CONNECTIONS|parallel_files|download_threads|workers|download_http_urls|argparse|click|typer|fire|--parallel|--threads|-w|-t" pridepy . -g '!**/.git/**'Repository: PRIDE-Archive/pridepy
Length of output: 3029
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
from pathlib import Path
path = Path("pridepy/download/transport.py")
lines = path.read_text().splitlines()
for start, end in [(1,120), (820,900)]:
print(f"== {path} lines {start}-{end} ==")
for i in range(start, min(end, len(lines)) + 1):
print(f"{i:4d}: {lines[i-1]}")
print()
PY
echo "== Search CLI entry points and validations =="
python3 - <<'PY'
from pathlib import Path
import re
targets = [
Path("pridepy/pridepy.py"),
Path("pridepy/download/client.py"),
Path("pridepy/download/base.py"),
Path("pridepy/download/by_url.py"),
Path("pridepy/tests/test_cli_flatten.py"),
]
for p in targets:
if p.exists():
print(f"\n## {p}")
text = p.read_text().splitlines()
for i, line in enumerate(text, 1):
if re.search(r'parallel_files|download_threads|MAX_TOTAL_HTTP_CONNECTIONS|--parallel|--threads|-w\b|-t\b|argparse|click|typer|fire', line):
print(f"{i:4d}: {line}")
PY
echo
echo "== Broader symbol search =="
rg -n "parallel_files|download_threads|MAX_TOTAL_HTTP_CONNECTIONS|download_http_urls|--parallel|--threads|-w\\b|-t\\b|argparse|click|typer|fire" pridepy -g '!**/.git/**'Repository: PRIDE-Archive/pridepy
Length of output: 46710
Clamp workers as well as download_threads
click.IntRange(1, 32) protects the CLI, but direct Client.download_http_urls callers can still pass parallel_files > 64. In that case clamped_threads drops to 1 and the total HTTP connections can still exceed MAX_TOTAL_HTTP_CONNECTIONS. Bound workers too, or clamp the product from both sides.
🔧 Proposed fix
- workers = max(1, min(parallel_files, len(http_urls)))
+ workers = max(1, min(parallel_files, len(http_urls), MAX_TOTAL_HTTP_CONNECTIONS))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| workers = max(1, min(parallel_files, len(http_urls))) | |
| if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS: | |
| clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers) | |
| logging.warning( | |
| "parallel_files (%d) x download_threads (%d) = %d exceeds the " | |
| "combined connection cap of %d; reducing download_threads to %d " | |
| "to keep peak HTTP connections bounded.", | |
| workers, | |
| download_threads, | |
| workers * download_threads, | |
| MAX_TOTAL_HTTP_CONNECTIONS, | |
| clamped_threads, | |
| ) | |
| download_threads = clamped_threads | |
| workers = max(1, min(parallel_files, len(http_urls), MAX_TOTAL_HTTP_CONNECTIONS)) | |
| if workers * download_threads > MAX_TOTAL_HTTP_CONNECTIONS: | |
| clamped_threads = max(1, MAX_TOTAL_HTTP_CONNECTIONS // workers) | |
| logging.warning( | |
| "parallel_files (%d) x download_threads (%d) = %d exceeds the " | |
| "combined connection cap of %d; reducing download_threads to %d " | |
| "to keep peak HTTP connections bounded.", | |
| workers, | |
| download_threads, | |
| workers * download_threads, | |
| MAX_TOTAL_HTTP_CONNECTIONS, | |
| clamped_threads, | |
| ) | |
| download_threads = clamped_threads |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/download/transport.py` around lines 852 - 865, The connection cap
logic in Client.download_http_urls only clamps download_threads, so callers can
still exceed MAX_TOTAL_HTTP_CONNECTIONS when workers is very large. Update the
worker calculation near workers and clamped_threads to bound workers as well (or
otherwise clamp the workers x download_threads product from both sides) before
starting downloads, keeping the final total HTTP connections within
MAX_TOTAL_HTTP_CONNECTIONS.
| def download_px_raw_files( | ||
| accession: str, | ||
| output_folder: str, | ||
| skip_if_downloaded_already: bool, | ||
| protocol: str = "ftp", | ||
| download_threads: int = 1, | ||
| parallel_files: int = 1, | ||
| preserve_structure: bool = False, | ||
| iprox_user: Optional[str] = None, | ||
| ): | ||
| """CLI wrapper to download raw files via ProteomeXchange XML.""" | ||
| files = Files() | ||
| logging.info(f"PX accession/URL: {accession}") | ||
|
|
||
| password = os.environ.get("IPROX_ASPERA_PASSWORD") | ||
| if protocol.lower() == "aspera" and not password: | ||
| password = click.prompt("iProX Aspera password", hide_input=True) | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Password prompt fires even when --iprox-user was never provided.
If protocol == "aspera" and IPROX_ASPERA_PASSWORD isn't set, click.prompt runs unconditionally — even if iprox_user is None. The user is prompted for a password only to hit IproxProvider.aspera_download's credential-validation error afterward. Fail fast before prompting.
🛡️ Proposed fix
password = os.environ.get("IPROX_ASPERA_PASSWORD")
- if protocol.lower() == "aspera" and not password:
+ if protocol.lower() == "aspera" and not iprox_user:
+ raise click.UsageError("--iprox-user is required with --protocol aspera.")
+ if protocol.lower() == "aspera" and not password:
password = click.prompt("iProX Aspera password", hide_input=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def download_px_raw_files( | |
| accession: str, | |
| output_folder: str, | |
| skip_if_downloaded_already: bool, | |
| protocol: str = "ftp", | |
| download_threads: int = 1, | |
| parallel_files: int = 1, | |
| preserve_structure: bool = False, | |
| iprox_user: Optional[str] = None, | |
| ): | |
| """CLI wrapper to download raw files via ProteomeXchange XML.""" | |
| files = Files() | |
| logging.info(f"PX accession/URL: {accession}") | |
| password = os.environ.get("IPROX_ASPERA_PASSWORD") | |
| if protocol.lower() == "aspera" and not password: | |
| password = click.prompt("iProX Aspera password", hide_input=True) | |
| def download_px_raw_files( | |
| accession: str, | |
| output_folder: str, | |
| skip_if_downloaded_already: bool, | |
| protocol: str = "ftp", | |
| download_threads: int = 1, | |
| parallel_files: int = 1, | |
| preserve_structure: bool = False, | |
| iprox_user: Optional[str] = None, | |
| ): | |
| """CLI wrapper to download raw files via ProteomeXchange XML.""" | |
| files = Files() | |
| logging.info(f"PX accession/URL: {accession}") | |
| password = os.environ.get("IPROX_ASPERA_PASSWORD") | |
| if protocol.lower() == "aspera" and not iprox_user: | |
| raise click.UsageError("--iprox-user is required with --protocol aspera.") | |
| if protocol.lower() == "aspera" and not password: | |
| password = click.prompt("iProX Aspera password", hide_input=True) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/pridepy.py` around lines 404 - 421, The password prompt path in
download_px_raw_files is running even when iprox_user was never provided. Update
the aspera branch in download_px_raw_files to validate iprox_user before calling
click.prompt, and fail fast with the existing credential error path if it is
missing. Keep the fix localized to the download_px_raw_files flow and its
handoff to IproxProvider.aspera_download so the prompt only appears when a user
can actually authenticate.
| password="secret", | ||
| maximum_bandwidth="100M", | ||
| ) | ||
| args, kwargs = run.call_args |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Clean up the Ruff warnings in the new tests.
Use a dummy name for the unused unpacked value and escape the literal dot in the pytest.raises(..., match=...) pattern.
🧹 Suggested lint fixes
- args, kwargs = run.call_args
+ args, _kwargs = run.call_args
argv = args[0]- with pytest.raises(RuntimeError, match="b.raw"):
+ with pytest.raises(RuntimeError, match=re.escape("b.raw")):Also applies to: 166-166
🧰 Tools
🪛 Ruff (0.15.20)
[warning] 143-143: Unpacked variable kwargs is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/tests/test_iprox_aspera.py` at line 143, Clean up the Ruff warnings
in the new tests by fixing the unused tuple unpack from run.call_args in the
test that inspects the mock invocation: assign the unused value to a dummy name
(for example in the assertion around run.call_args). Also update the
pytest.raises(..., match=...) pattern in the affected test(s) to escape the
literal dot so the regex matches the exact error text instead of treating the
dot as a wildcard.
Source: Linters/SAST tools
| rels = asp.call_args.kwargs["relative_paths"] | ||
| # Flattened + de-duped basenames, no subdirectories preserved. | ||
| assert set(rels) == {"a.raw", "a_1.raw"} |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Assert flattened paths in order.
relative_paths is positional with urls; using a set would miss a swapped mapping.
✅ Suggested assertion
rels = asp.call_args.kwargs["relative_paths"]
# Flattened + de-duped basenames, no subdirectories preserved.
- assert set(rels) == {"a.raw", "a_1.raw"}
+ assert rels == ["a.raw", "a_1.raw"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rels = asp.call_args.kwargs["relative_paths"] | |
| # Flattened + de-duped basenames, no subdirectories preserved. | |
| assert set(rels) == {"a.raw", "a_1.raw"} | |
| rels = asp.call_args.kwargs["relative_paths"] | |
| # Flattened + de-duped basenames, no subdirectories preserved. | |
| assert rels == ["a.raw", "a_1.raw"] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/tests/test_iprox_aspera.py` around lines 279 - 281, The test in
test_iprox_aspera should assert the order of relative_paths instead of
converting to a set, since its mapping to urls is positional and a swapped
pairing would be missed. Update the assertion near
asp.call_args.kwargs["relative_paths"] to compare the flattened, de-duped
basenames in sequence, preserving the existing no-subdirectories expectation
while checking the exact order.
Code Review by Qodo
1. Aspera prompt can hang
|
| password = os.environ.get("IPROX_ASPERA_PASSWORD") | ||
| if protocol.lower() == "aspera" and not password: | ||
| password = click.prompt("iProX Aspera password", hide_input=True) |
There was a problem hiding this comment.
1. Aspera prompt can hang 🐞 Bug ☼ Reliability
download_px_raw_files prompts for an iProX Aspera password as soon as --protocol aspera is set and IPROX_ASPERA_PASSWORD is missing, before verifying --iprox-user is present or that the dataset contains iProX-hosted files. This can block non-interactive executions and unnecessarily asks for a password even when the code will later error out.
Agent Prompt
## Issue description
The CLI prompts for the iProX Aspera password too early (immediately on `--protocol aspera`), which can hang in non-interactive contexts and can prompt even if the dataset has no iProX-hosted files or `--iprox-user` is missing.
## Issue Context
- Aspera mode is only valid if (a) the dataset has iProX-hosted files and (b) credentials are provided.
- Current flow prompts before either condition is validated.
## Fix Focus Areas
- pridepy/pridepy.py[418-431]
- pridepy/download/proteomexchange.py[199-238]
### Suggested fix
1) In the CLI handler, if `protocol.lower() == "aspera"` and `iprox_user` is missing, fail fast with a `click.UsageError` (no prompt).
2) Avoid prompting until you know Aspera will be used:
- Option A (preferred): do a lightweight preflight listing (PX XML fetch) to confirm at least one file is hosted on `download.iprox.org`; if none, error without prompting.
- Option B: call the downloader with `iprox_password=None` and, on the specific "missing credentials" error, prompt only if stdin is a TTY; otherwise raise a clear error instructing to set `IPROX_ASPERA_PASSWORD`.
3) If prompting is needed, only prompt when stdin is interactive; otherwise require `IPROX_ASPERA_PASSWORD`.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| - **MassIVE** walks the FTPS tree at `massive-ftp.ucsd.edu` (the server requires TLS). MassIVE distributes datasets across versioned root directories (`/v01`–`/vNN`); `pridepy` discovers the correct root automatically. If FTP/FTPS is blocked by the network, `pridepy` falls back to HTTPS: it lists the dataset from the GNPS2 file index (`datasetcache.gnps2.org`) and downloads each file from the ProteoSAFe endpoint at `massive.ucsd.edu` (byte-identical to the FTPS copy). | ||
| - **JPOST** lists files through the JSON PROXI endpoint at `https://repository.jpostdb.org/proxi/datasets/<JPSTxxxxxx>` and downloads from `ftp.jpostdb.org` over plain FTP. The PROXI listing avoids the source-IP connection limit JPOST enforces on FTP. | ||
| - **iProX** fetches the dataset's ProteomeXchange XML from `http://download.iprox.org/<accession>/PX_<accession>.xml`, then downloads each referenced file from the same host over anonymous HTTP (with `Range` support for resume). iProX also exposes Aspera (`faspe://`) with username/password for very large bulk transfers; `pridepy` uses the public HTTP endpoint so no iProX credentials are required. | ||
| - **iProX** fetches the dataset's ProteomeXchange XML from `http://download.iprox.org/<accession>/PX_<accession>.xml`, then downloads each referenced file from the same host over anonymous HTTP (with `Range` support for resume). iProX also exposes Aspera with username/password for very large bulk transfers; `pridepy` uses the public HTTP endpoint so no iProX credentials are required. |
There was a problem hiding this comment.
2. Iprox creds text ambiguous 🐞 Bug ⚙ Maintainability
The iProX repository description still says pridepy uses the public HTTP endpoint so no iProX credentials are required, without clarifying that credentials are required when opting into the new --protocol aspera mode. This can confuse users about when --iprox-user/IPROX_ASPERA_PASSWORD are needed.
Agent Prompt
## Issue description
The iProX docs section implies no iProX credentials are required, but the same document also introduces optional credentialed Aspera mode. The repo description should distinguish default HTTP behavior from opt-in Aspera behavior.
## Issue Context
Users reading the "How each repository is enumerated" section may miss the earlier Aspera section and assume credentials are never needed.
## Fix Focus Areas
- docs/usage.md[279-294]
- docs/usage.md[323-323]
### Suggested fix
Rewrite the iProX bullet to something like:
"... downloads over anonymous HTTP by default (no credentials required). iProX also supports opt-in Aspera transfers (`--protocol aspera`), which require an iProX account (`--iprox-user` and `IPROX_ASPERA_PASSWORD`)."
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
This pull request introduces significant improvements to the
pridepytool, focusing on enhanced download performance and expanded protocol support for bulk dataset retrieval, especially from iProX. The main changes add parallel file downloads, per-file HTTP range threading, and robust Aspera (fast transfer) support for iProX datasets, along with corresponding updates to CLI options and documentation.Major feature additions and enhancements:
Parallel and segmented downloads:
--parallel-files/-w) and for using multiple threads per file (--threads/-t), allowing faster and more efficient bulk downloads. The total number of concurrent HTTP connections is capped for stability. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]Aspera protocol integration for iProX:
--protocol aspera) as a fast transfer option for iProX datasets, including secure password handling and support for parallel Aspera transfers. The CLI and download logic now route iProX downloads via Aspera when requested and credentials are provided. [1] [2] [3] [4]User interface and documentation updates:
CLI and usage documentation:
docs/usage.md) with detailed instructions and examples for fast downloads using parallelism and Aspera. Also clarified iProX password handling for Aspera. [1] [2] [3] [4]Internal API and codebase improvements:
Refactoring and extensibility:
These changes collectively make
pridepymuch more capable for power users and large-scale dataset retrieval, while keeping the tool user-friendly and robust.Summary by CodeRabbit
New Features
Bug Fixes
Documentation