Skip to content

Add secret detectors (G110–G133) and tighten FP suppression#55

Merged
ParzivalHack merged 1 commit into
ParzivalHack:mainfrom
satoridev01:feat/secret-detectors-g110-g133
May 27, 2026
Merged

Add secret detectors (G110–G133) and tighten FP suppression#55
ParzivalHack merged 1 commit into
ParzivalHack:mainfrom
satoridev01:feat/secret-detectors-g110-g133

Conversation

@satoridev01
Copy link
Copy Markdown
Contributor

Summary

Adds detectors for 24 common credential formats (AWS, GitHub, GitLab, Slack, Stripe, Google, OpenAI, Anthropic/Claude, SendGrid, PostHog, NPM, PyPI, Discord, Telegram, DigitalOcean, Doppler, Cloudflare, Heroku, HubSpot, Fastly, plus DB-connection-string and basic-auth-URL detectors) and significantly reduces false-positive noise from the existing G101 / G101B / G102 / G103 / G104 / AI404 rules by extending their exclude_pattern and exclude_file_pattern lists. Validated against a 763-repo corpus side-by-side with TruffleHog.

New

Rule Provider Format
G110 AWS `(AKIA
G111 GitHub `(ghp
G112 GitLab glpat-[A-Za-z0-9_-]{20}
G113 Slack token xox[abprso]-[A-Za-z0-9-]{10,}
G114 Slack webhook https://hooks.slack.com/services/T<id>/B<id>/<token>
G115 Stripe `(sk
G116 Google AIza[A-Za-z0-9_-]{35}
G117 OpenAI sk-[A-Za-z0-9]{48} and `sk-(proj
G118 Anthropic / Claude `sk-ant-(api
G119 SendGrid SG\.[A-Za-z0-9_-]{22}\.[A-Za-z0-9_-]{43}
G120 PostHog phc_[A-Za-z0-9]{40}
G121 Database URL with creds `(postgres(ql)?
G122 JWT in code eyJ…\.eyJ…\.[A-Za-z0-9_-]+ (3-part)
G123 Basic-auth URL https?://user:pass@host… (password forbidden to contain / — eliminates JS-stack-trace FPs)
G124 NPM npm_[A-Za-z0-9]{36}
G125 PyPI pypi-AgEIcHlwaS5vcmc[A-Za-z0-9_-]{50,}
G126 Discord bot [MN][A-Za-z0-9]{23}\.[\w-]{6}\.[\w-]{27}
G127 Telegram bot \d{8,10}:[A-Za-z0-9_-]{35}
G128 DigitalOcean `(dop
G129 Doppler `dp.(pt
G130 Cloudflare OCA Key: v1\.0-[a-f0-9]{32}-[a-f0-9]{146} + 40-char tokens near "cloudflare" keyword
G131 Heroku UUID near "heroku" keyword (legacy format)
G132 HubSpot `pat-(na1
G133 Fastly 32-char token near "fastly" keyword

Changed

  • AI404 (Hugging Face): pattern tightened to require at least 16 consecutive alphanumeric chars after hf_. Eliminates placeholder FPs like hf_token, hf_X, hf_xxx_your_token, hf_..... Doctest lines (>>> / ...) excluded.

  • G104 (JWT secret): pattern now requires ≥16 non-quote chars in the value (previously .+ matched literal field-name values like "kb_jwt"). exclude_pattern added: your_, change-(me|in-production), default-secret, do-not-share, demo-, never-(hardcode|use).

  • G101 (broad password/secret): exclude_pattern extended to suppress:

    • common placeholder values: your_, insert_, example_, placeholder, change-me, replace-me, todo, fake, dummy, sample, demo, server_api_key, api_key_secret, my_password, root_password
    • values ending in _here / containing *_HERE
    • all-uppercase placeholder-name strings like "YOUR_OPENAI_API_KEY"
    • lines starting with print(, click.echo(, sys.stderr. (instructional output)
    • doctest lines (>>> / ...)
  • G101B (uppercase const secret): same placeholder / instructional-line / doctest exclusions.

  • G102 (private key block): added exclude_file_pattern = "*.md,*.rst,*.html,*.txt,*.adoc,*.tex,*.ipynb". Documentation / walkthrough / knowledge-base content showing -----BEGIN … PRIVATE KEY----- as an example was a 100% FP source in our corpus.

  • G103 (blank password): exclude_pattern adds ^\s*[A-Z][A-Z0-9_]+\s*= (Django/Flask uppercase config defaults like EMAIL_HOST_PASSWORD = "" are intentionally overrideable from env). exclude_file_pattern adds *settings*.py,*config*.py.

  • G117, G113: explicit -your-, -here\b, -replace- substring excludes catch patterns like xoxb-your-slack-bot-token and sk-svcacct-your-embedding-key-here. exclude_file_pattern adds *.env.example,*.env.template,*.env.sample,*.env.dist,env.example.

G121 / G121L — production vs dev-default split

G121 (Critical / High) now excludes connection strings whose host is one of the well-known local/docker-compose names (localhost, 127.0.0.1, 0.0.0.0, ::1, host.docker.internal, db, database, postgres(ql), mysql, mariadb, mongo(db), redis, rabbitmq, broker, kafka, memcached, amqp). Host tokens are matched only when followed by a URL-component terminator (:, /, ?, #, quote, whitespace), so substrings like db.prod.example.com still hit G121 — only standalone host tokens like @db:5432 get downgraded.

G121L (new, Low / Low) covers the dev-default class: same connection-string shape, but only when the host is one of those local/container names. This converts the dominant remaining G121 FP class — postgresql://guaardvark:guaardvark@localhost:5432/guaardvark-style local-dev defaults — into a separate, low-priority signal that an analyst can choose to ignore or batch-review, without dropping the finding entirely (it is still a literal hardcoded credential).

[defaults].exclude_pattern_placeholder now declares the placeholder/dummy-secret regex ((?i)EXAMPLE|FAKE|PLACEHOLDER|SAMPLE|x{10,}|0{10,}|1{10,}|abcdefghij|1234567890abcdef|AbCdEfGhIjKlMnOp|f3a8b2c1) in one place. Each rule's exclude_pattern references it via the sentinel __SHARED_PLACEHOLDERS__, which get_default_rules() (in pyspector/config.py) string-substitutes before handing the TOML text to the Rust core. Adding a new placeholder shape is now a one-line edit rather than touching 15 rule blocks. The Rust core needs no changes — substitution happens in the existing Python rule-loading path. Existing rule TOMLs without the sentinel continue to work unchanged.

G122 unscoping

G122 previously had file_pattern = "*.py". JWTs leak into .yaml, .json, .sh, .tf, and CI configs at least as often as into Python files. Removing the restriction adds new TP coverage without measurable FP impact (2 new hits in the validation corpus, both edge cases in .drawio and .json files containing image-URL JWTs).

Shared FP fixes triggered by the validation corpus

  • G121 / G123 now suppress f-string and shell interpolation in the credential portion: {var}, {self.x}, ${VAR}, $(VAR), $VAR, <placeholder>, {{ var }} (Jinja/Helm).
  • G121 ignores re.match() / re.compile() / re.search() patterns that happen to describe a connection-string shape.
  • G123 pattern now forbids / in the password segment, eliminating the dominant JS-stack-trace FP class (http://localhost:5173/node_modules/.vite/deps/@react.js?…:759:3) @ http://…). *.log added to exclude_file_pattern.
  • G121 / G123 add *.env.example,*.env.template,*.tpl,*.j2,*.jinja,*.template,*cookiecutter* to exclude_file_pattern.
  • G114 placeholder filter now suppresses Slack webhook URLs with T00000000/B00000000/XXXX… template values.
  • G110 suppresses AKIAIOSFOLQUICKSTART (well-known lakefs quickstart documented credential).

Validation

Comparison with TruffleHog (v3.95.3) on 763 repos that originally flagged any "Hardcoded" finding.** Both tools scanned the same shallow clones; PySpector with the new rules, TruffleHog with --no-verification for fair format-vs-format comparison.

Tool Findings Heuristic-TP Heuristic-FP Precision
PySpector v2 (this PR) 1,135 884 251 78%
TruffleHog 3.95.3 (no-verify) 5,814 462 5,352 8%

Comparison with PySpector vs Modified PySpector

Metric Original This PR Change
Total findings 2,295 1,135 −51%
Heuristic-TP 1,242 884 −29%
Heuristic-FP 1,053 251 −76%
Precision 54.1% 77.9% +23.8pp
TP : FP ratio 1.18 3.52 ~3× better

Per-rule breakdown (500 OK-cloned repo subset)

Rule Original This PR Notes
G101 1,743 702 Tightened exclude — kills ~60% of placeholder / instructional-output FPs
G101B 359 0 Largely subsumed by the new format-specific rules; placeholders also filtered
G102 138 100 .md/.rst doc-extension excludes drop ~30 walkthrough FPs
G103 48 34 UPPER_CASE = "" config-default and *settings*.py excludes
G104 2 2 Same hits
AI404 5 1 Tightened to require ≥16 alnum chars after hf_
G110 1 NEW — AWS access key (AKIA…)
G115 2 NEW — Stripe live/test keys
G116 187 NEW — Google API key (AIza…)
G117 22 NEW — OpenAI sk-… / sk-proj-…
G121 49 NEW — DB connection string with embedded credentials
G122 26 NEW — three-part JWT in code (now non-Python files too)
G123 8 NEW — basic-auth URL
G127 1 NEW — Telegram bot token
G110–G127 total 0 296 NEW provider coverage absent from the original ruleset
TOTAL 2,295 1,135

…ten FP suppression

New high-precision provider detectors for AWS, GitHub, GitLab, Slack, Stripe,
Google, OpenAI, Anthropic/Claude, SendGrid, PostHog, DB-connection-URL,
JWT-in-code, basic-auth-URL, NPM, PyPI, Discord, Telegram, DigitalOcean,
Doppler, Cloudflare, Heroku, HubSpot, and Fastly. All Tier-1 rules ship with
a shared placeholder-value filter and exclude documentation / lockfile /
env-example / template extensions.

G121 / G121L split: the existing DB-connection-URL rule now excludes localhost
and common docker-compose hostnames (db, postgres, mysql, redis, rabbitmq, ...)
and a new G121L rule (severity Low, confidence Low) catches the dev-default
class separately so analysts can triage them at a different priority.

G122 unscoped from *.py: JWT secrets leak into .yaml/.json/.sh as often as into
Python files; doc/lockfile extensions are still excluded.

Existing rules G101, G101B, G102, G103, G104, AI404, G117, G113 have extended
exclude_pattern / exclude_file_pattern to suppress the dominant FP categories
observed across a 1000-repo validation corpus:
 - placeholder values (your_*, *_here, INSERT_*, etc.)
 - instructional print() / click.echo() output
 - doctest lines (>>> / ...)
 - Django/Flask UPPER_CASE settings defaults
 - .md/.rst walkthroughs containing example PEM keys
 - JS stack-trace lines in .log files
 - f-string / shell interpolation in connection strings

Shared placeholder regex hoisted into [defaults].exclude_pattern_placeholder;
each rule's exclude_pattern references it via the __SHARED_PLACEHOLDERS__
sentinel, which get_default_rules() in config.py string-substitutes at
rule-load time. Adding a new placeholder shape is now a one-line edit rather
than touching 15 rule blocks. No Rust changes needed.

Validation:
- 100-repo sample: 0 misses against independent regex sweep
- 1000-repo sample: ~70% FP reduction, all confirmed real TPs preserved
- 763-repo dual-scan vs TruffleHog 3.95.3 --no-verification:
    PySpector  1135 findings, ~78% heuristic precision
    TruffleHog 5814 findings,  ~8% heuristic precision

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ParzivalHack ParzivalHack added the enhancement New feature or request label May 27, 2026
Copy link
Copy Markdown
Owner

@ParzivalHack ParzivalHack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect PR as always. Merging :D

@ParzivalHack ParzivalHack merged commit 429c9f4 into ParzivalHack:main May 27, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants