fix(parser): time-box pdftable table extraction (un-hang parse) by hallelx2 · Pull Request #29 · hallelx2/vectorless-engine

hallelx2 · 2026-05-28T16:41:25Z

A 1.27MB 3M 10-K sat in parsing 20+ min during a bench run — pdftable's geometry finder has a pathological case on dense financial pages. Parse is pre-LLM so the ingest LLM-call timeout doesn't cover it. Adds a 15s/page timeout (goroutine + select) in safeExtractTables and a 90s/document budget in extractPDFTables; a slow page falls back to text-only (the page text, incl. the table, still lives in prose sections — what the page-based strategy reads). go build/vet/test green.

Summary by Sourcery

Time-box PDF table extraction to prevent parser hangs on pathological pages and allow ingest to fall back to text-only when limits are exceeded.

Bug Fixes:

Prevent PDF parsing from hanging indefinitely on dense financial pages by enforcing per-page and per-document time limits around pdftable extraction.

Enhancements:

Log table extraction timeouts and failures at warn level while allowing documents to continue ingesting without table sections.

Summary by CodeRabbit

Bug Fixes

Improved PDF table extraction reliability with time-based processing limits for both individual pages and entire documents.
Enhanced error handling for slow or problematic PDF extraction operations to ensure consistent performance.

A 1.27MB 3M 10-K sat in `parsing` for 20+ minutes during a bench run: pdftable's geometry-based table finder has a pathological case on dense financial pages (a balance sheet's hundreds of ruling segments blow up the intersection-grid construction). Parse is pure-geometry / pre-LLM, so the ingest per-LLM-call timeout (the earlier fix) doesn't cover it, and there was no other bound — the document hung indefinitely. Two bounds, both in pkg/parser/pdf.go: - safeExtractTables now runs page.ExtractTables on a goroutine and abandons it after tableExtractPageTimeout (15s). The slow page is skipped (text-only); a single bad page can't stall the parse. - extractPDFTables tracks a doc-wide tableExtractDocBudget (90s); once spent, remaining pages are skipped and ingest proceeds. A document that loses its table sections still indexes fine — the page text (including the table's text) lives in the prose sections, which is exactly what the page-based retrieval strategy reads. The proper fix is upstream in pdftable's finder; this keeps ingest robust meanwhile. go build/vet/test green.

sourcery-ai · 2026-05-28T16:41:32Z

Reviewer's Guide

Adds strict time-bounding to PDF table extraction so pathological pdftable pages and documents can't hang parsing, using per-page timeouts and a document-wide budget while preserving ingest via text-only fallback.

File-Level Changes

Change	Details	Files
Enforce a document-wide time budget for PDF table extraction and stop table detection once it is exhausted.	Introduce a doc-level deadline using tableExtractDocBudget and time.Now().Add in extractPDFTables. Before processing each page, check the deadline; if exceeded, log a warning with pages processed and break out of the loop, allowing ingest to continue without further table extraction.	`pkg/parser/pdf.go`
Introduce bounded, panic-safe per-page table extraction with a goroutine + timeout wrapper.	Replace the previous recover-based safeExtractTables with a goroutine that calls page.ExtractTables and returns through a buffered channel. Add a select between the goroutine result and time.After(tableExtractPageTimeout) to enforce a per-page timeout; on timeout, log a warning and return nil tables. Keep panic protection inside the goroutine, logging panics and sending an empty result to the channel; log non-nil errors from ExtractTables and suppress tables on failures. Change the safeExtractTables signature to return tables directly (no named return) and route all exit paths through the select logic.	`pkg/parser/pdf.go`
Define constants and imports required to support time-bounded table extraction.	Import the time package for deadlines and timeouts. Add tableExtractPageTimeout and tableExtractDocBudget constants with explanatory comments describing the motivation and behavior of time-bounded extraction.	`pkg/parser/pdf.go`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2026-05-28T16:41:37Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e8ad446a-4e4e-46ce-8a2b-58139c143585

📥 Commits

Reviewing files that changed from the base of the PR and between 678bcf0 and 856b0f4.

📒 Files selected for processing (1)

pkg/parser/pdf.go

📝 Walkthrough

Walkthrough

This PR adds time-based limits to PDF table extraction in a single file. It introduces per-page timeouts and a document-wide budget that halt further extraction when exceeded. The previous panic-recovery wrapper is replaced with a goroutine-based timeout guard that logs warnings and returns nil when extraction stalls.

Changes

PDF Table Extraction Timeout System

Layer / File(s)	Summary
Time bounds constants and import `pkg/parser/pdf.go`	`time` package imported; `tableExtractPageTimeout` (per-page limit) and `tableExtractDocBudget` (document-wide limit) constants defined.
Per-page timeout mechanism `pkg/parser/pdf.go`	New `safeExtractTables` implementation uses goroutine + channel + select to enforce per-page timeout; panics and errors are logged and return no tables, while timeouts produce nil.
Document-wide budget enforcement `pkg/parser/pdf.go`	`extractPDFTables` computes a deadline before the page loop and breaks with warning logging once the budget is exhausted.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A rabbit's tale of patient extraction,
Pages timed with precise action.
No goroutine shall linger too long,
Deadlines keep the PDFs strong!
Hop, extract, then flee the scene—
Fastest parsing ever seen! 🎯

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/table-extraction-timeout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In safeExtractTables, consider replacing time.After(tableExtractPageTimeout) with a time.NewTimer and defer timer.Stop() so the timer can be stopped when the done case wins, avoiding the extra timer goroutine/allocation for every page.
The doc-level budget check in extractPDFTables uses time.Now() inside the loop; if you expect very large documents, you might want to precompute the deadline once (as you do) and compare against a monotonic clock (time.Since(start)) to avoid potential issues if the system clock is adjusted.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `safeExtractTables`, consider replacing `time.After(tableExtractPageTimeout)` with a `time.NewTimer` and `defer timer.Stop()` so the timer can be stopped when the `done` case wins, avoiding the extra timer goroutine/allocation for every page.
- The doc-level budget check in `extractPDFTables` uses `time.Now()` inside the loop; if you expect very large documents, you might want to precompute the deadline once (as you do) and compare against a monotonic clock (`time.Since(start)`) to avoid potential issues if the system clock is adjusted.

## Individual Comments

### Comment 1
<location path="pkg/parser/pdf.go" line_range="1341-1350" />
<code_context>
+		tables []*pdftable.Table
+		err    error
+	}
+	done := make(chan result, 1)
+	go func() {
+		defer func() {
+			if r := recover(); r != nil {
+				slog.Warn("pdf: table extraction panicked",
+					"page", pageNum, "panic", fmt.Sprintf("%v", r))
+				done <- result{}
+			}
+		}()
+		t, err := page.ExtractTables(settings)
+		done <- result{tables: t, err: err}
 	}()
-	tables, err := page.ExtractTables(settings)
-	if err != nil {
-		slog.Warn("pdf: table extraction failed",
-			"page", pageNum,
-			"err", err)
+
+	select {
+	case <-time.After(tableExtractPageTimeout):
+		// The goroutine is abandoned (it will finish on its own and the
</code_context>
<issue_to_address>
**issue (performance):** Current timeout approach abandons the goroutine instead of canceling it, which could tie up resources longer than necessary.

Since the goroutine keeps running after the select times out, `page.ExtractTables` can still consume CPU and memory for its full worst-case duration, and many such timeouts could lead to a buildup of in-flight goroutines. If pdftable supports contexts, consider wrapping the call in a `context.WithTimeout` and passing it through so the work is actually canceled. If not, consider limiting concurrent extractions (e.g., worker pool or semaphore) so callers can’t create unbounded abandoned goroutines on problematic documents.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-05-28T16:42:48Z

+	done := make(chan result, 1)
+	go func() {
+		defer func() {
+			if r := recover(); r != nil {
+				slog.Warn("pdf: table extraction panicked",
+					"page", pageNum, "panic", fmt.Sprintf("%v", r))
+				done <- result{}
+			}
+		}()
+		t, err := page.ExtractTables(settings)


issue (performance): Current timeout approach abandons the goroutine instead of canceling it, which could tie up resources longer than necessary.

Since the goroutine keeps running after the select times out, page.ExtractTables can still consume CPU and memory for its full worst-case duration, and many such timeouts could lead to a buildup of in-flight goroutines. If pdftable supports contexts, consider wrapping the call in a context.WithTimeout and passing it through so the work is actually canceled. If not, consider limiting concurrent extractions (e.g., worker pool or semaphore) so callers can’t create unbounded abandoned goroutines on problematic documents.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds strict time bounds to PDF table extraction to prevent pathological PDFs from stalling ingest.

Changes:

Introduces a document-wide table extraction budget to skip remaining pages once exceeded.
Wraps per-page table extraction in a goroutine with a per-page timeout and panic recovery.
Adds constants for page timeout and document budget.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+	select {
+	case <-time.After(tableExtractPageTimeout):
+		// The goroutine is abandoned (it will finish on its own and the
+		// buffered channel lets it send without blocking, then get GC'd).
+		// A single slow page is not worth stalling the whole ingest.
+		slog.Warn("pdf: table extraction timed out; skipping page",
+			"page", pageNum, "timeout", tableExtractPageTimeout)
 		return nil
+	case res := <-done:
+		if res.err != nil {
+			slog.Warn("pdf: table extraction failed",
+				"page", pageNum, "err", res.err)
+			return nil
+		}
+		return res.tables
 	}


+	done := make(chan result, 1)
+	go func() {
+		defer func() {
+			if r := recover(); r != nil {
+				slog.Warn("pdf: table extraction panicked",
+					"page", pageNum, "panic", fmt.Sprintf("%v", r))
+				done <- result{}
+			}
+		}()
+		t, err := page.ExtractTables(settings)
+		done <- result{tables: t, err: err}
 	}()


Copilot AI review requested due to automatic review settings May 28, 2026 16:41

hallelx2 merged commit 8d292cb into main May 28, 2026

hallelx2 deleted the fix/table-extraction-timeout branch May 28, 2026 16:41

sourcery-ai Bot reviewed May 28, 2026

View reviewed changes

Copilot AI reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parser): time-box pdftable table extraction (un-hang parse)#29

fix(parser): time-box pdftable table extraction (un-hang parse)#29
hallelx2 merged 1 commit into
mainfrom
fix/table-extraction-timeout

hallelx2 commented May 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

sourcery-ai Bot commented May 28, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hallelx2 commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Summary by CodeRabbit

Bug Fixes

Uh oh!

sourcery-ai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hallelx2 commented May 28, 2026 •

edited by coderabbitai Bot

Loading

sourcery-ai Bot commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading