fix(parser): time-box pdftable table extraction (un-hang parse)#29
Conversation
A 1.27MB 3M 10-K sat in `parsing` for 20+ minutes during a bench run:
pdftable's geometry-based table finder has a pathological case on dense
financial pages (a balance sheet's hundreds of ruling segments blow up
the intersection-grid construction). Parse is pure-geometry / pre-LLM,
so the ingest per-LLM-call timeout (the earlier fix) doesn't cover it,
and there was no other bound — the document hung indefinitely.
Two bounds, both in pkg/parser/pdf.go:
- safeExtractTables now runs page.ExtractTables on a goroutine and
abandons it after tableExtractPageTimeout (15s). The slow page is
skipped (text-only); a single bad page can't stall the parse.
- extractPDFTables tracks a doc-wide tableExtractDocBudget (90s);
once spent, remaining pages are skipped and ingest proceeds.
A document that loses its table sections still indexes fine — the page
text (including the table's text) lives in the prose sections, which is
exactly what the page-based retrieval strategy reads. The proper fix is
upstream in pdftable's finder; this keeps ingest robust meanwhile.
go build/vet/test green.
Reviewer's GuideAdds strict time-bounding to PDF table extraction so pathological pdftable pages and documents can't hang parsing, using per-page timeouts and a document-wide budget while preserving ingest via text-only fallback. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds time-based limits to PDF table extraction in a single file. It introduces per-page timeouts and a document-wide budget that halt further extraction when exceeded. The previous panic-recovery wrapper is replaced with a goroutine-based timeout guard that logs warnings and returns nil when extraction stalls. ChangesPDF Table Extraction Timeout System
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
safeExtractTables, consider replacingtime.After(tableExtractPageTimeout)with atime.NewTimeranddefer timer.Stop()so the timer can be stopped when thedonecase wins, avoiding the extra timer goroutine/allocation for every page. - The doc-level budget check in
extractPDFTablesusestime.Now()inside the loop; if you expect very large documents, you might want to precompute the deadline once (as you do) and compare against a monotonic clock (time.Since(start)) to avoid potential issues if the system clock is adjusted.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `safeExtractTables`, consider replacing `time.After(tableExtractPageTimeout)` with a `time.NewTimer` and `defer timer.Stop()` so the timer can be stopped when the `done` case wins, avoiding the extra timer goroutine/allocation for every page.
- The doc-level budget check in `extractPDFTables` uses `time.Now()` inside the loop; if you expect very large documents, you might want to precompute the deadline once (as you do) and compare against a monotonic clock (`time.Since(start)`) to avoid potential issues if the system clock is adjusted.
## Individual Comments
### Comment 1
<location path="pkg/parser/pdf.go" line_range="1341-1350" />
<code_context>
+ tables []*pdftable.Table
+ err error
+ }
+ done := make(chan result, 1)
+ go func() {
+ defer func() {
+ if r := recover(); r != nil {
+ slog.Warn("pdf: table extraction panicked",
+ "page", pageNum, "panic", fmt.Sprintf("%v", r))
+ done <- result{}
+ }
+ }()
+ t, err := page.ExtractTables(settings)
+ done <- result{tables: t, err: err}
}()
- tables, err := page.ExtractTables(settings)
- if err != nil {
- slog.Warn("pdf: table extraction failed",
- "page", pageNum,
- "err", err)
+
+ select {
+ case <-time.After(tableExtractPageTimeout):
+ // The goroutine is abandoned (it will finish on its own and the
</code_context>
<issue_to_address>
**issue (performance):** Current timeout approach abandons the goroutine instead of canceling it, which could tie up resources longer than necessary.
Since the goroutine keeps running after the select times out, `page.ExtractTables` can still consume CPU and memory for its full worst-case duration, and many such timeouts could lead to a buildup of in-flight goroutines. If pdftable supports contexts, consider wrapping the call in a `context.WithTimeout` and passing it through so the work is actually canceled. If not, consider limiting concurrent extractions (e.g., worker pool or semaphore) so callers can’t create unbounded abandoned goroutines on problematic documents.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| done := make(chan result, 1) | ||
| go func() { | ||
| defer func() { | ||
| if r := recover(); r != nil { | ||
| slog.Warn("pdf: table extraction panicked", | ||
| "page", pageNum, "panic", fmt.Sprintf("%v", r)) | ||
| done <- result{} | ||
| } | ||
| }() | ||
| t, err := page.ExtractTables(settings) |
There was a problem hiding this comment.
issue (performance): Current timeout approach abandons the goroutine instead of canceling it, which could tie up resources longer than necessary.
Since the goroutine keeps running after the select times out, page.ExtractTables can still consume CPU and memory for its full worst-case duration, and many such timeouts could lead to a buildup of in-flight goroutines. If pdftable supports contexts, consider wrapping the call in a context.WithTimeout and passing it through so the work is actually canceled. If not, consider limiting concurrent extractions (e.g., worker pool or semaphore) so callers can’t create unbounded abandoned goroutines on problematic documents.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds strict time bounds to PDF table extraction to prevent pathological PDFs from stalling ingest.
Changes:
- Introduces a document-wide table extraction budget to skip remaining pages once exceeded.
- Wraps per-page table extraction in a goroutine with a per-page timeout and panic recovery.
- Adds constants for page timeout and document budget.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| select { | ||
| case <-time.After(tableExtractPageTimeout): | ||
| // The goroutine is abandoned (it will finish on its own and the | ||
| // buffered channel lets it send without blocking, then get GC'd). | ||
| // A single slow page is not worth stalling the whole ingest. | ||
| slog.Warn("pdf: table extraction timed out; skipping page", | ||
| "page", pageNum, "timeout", tableExtractPageTimeout) | ||
| return nil | ||
| case res := <-done: | ||
| if res.err != nil { | ||
| slog.Warn("pdf: table extraction failed", | ||
| "page", pageNum, "err", res.err) | ||
| return nil | ||
| } | ||
| return res.tables | ||
| } |
| done := make(chan result, 1) | ||
| go func() { | ||
| defer func() { | ||
| if r := recover(); r != nil { | ||
| slog.Warn("pdf: table extraction panicked", | ||
| "page", pageNum, "panic", fmt.Sprintf("%v", r)) | ||
| done <- result{} | ||
| } | ||
| }() | ||
| t, err := page.ExtractTables(settings) | ||
| done <- result{tables: t, err: err} | ||
| }() |
A 1.27MB 3M 10-K sat in
parsing20+ min during a bench run — pdftable's geometry finder has a pathological case on dense financial pages. Parse is pre-LLM so the ingest LLM-call timeout doesn't cover it. Adds a 15s/page timeout (goroutine + select) insafeExtractTablesand a 90s/document budget inextractPDFTables; a slow page falls back to text-only (the page text, incl. the table, still lives in prose sections — what the page-based strategy reads). go build/vet/test green.Summary by Sourcery
Time-box PDF table extraction to prevent parser hangs on pathological pages and allow ingest to fall back to text-only when limits are exceeded.
Bug Fixes:
Enhancements:
Summary by CodeRabbit
Bug Fixes