Add secret detectors (G110–G133) and tighten FP suppression#55
Merged
ParzivalHack merged 1 commit intoMay 27, 2026
Merged
Conversation
…ten FP suppression
New high-precision provider detectors for AWS, GitHub, GitLab, Slack, Stripe,
Google, OpenAI, Anthropic/Claude, SendGrid, PostHog, DB-connection-URL,
JWT-in-code, basic-auth-URL, NPM, PyPI, Discord, Telegram, DigitalOcean,
Doppler, Cloudflare, Heroku, HubSpot, and Fastly. All Tier-1 rules ship with
a shared placeholder-value filter and exclude documentation / lockfile /
env-example / template extensions.
G121 / G121L split: the existing DB-connection-URL rule now excludes localhost
and common docker-compose hostnames (db, postgres, mysql, redis, rabbitmq, ...)
and a new G121L rule (severity Low, confidence Low) catches the dev-default
class separately so analysts can triage them at a different priority.
G122 unscoped from *.py: JWT secrets leak into .yaml/.json/.sh as often as into
Python files; doc/lockfile extensions are still excluded.
Existing rules G101, G101B, G102, G103, G104, AI404, G117, G113 have extended
exclude_pattern / exclude_file_pattern to suppress the dominant FP categories
observed across a 1000-repo validation corpus:
- placeholder values (your_*, *_here, INSERT_*, etc.)
- instructional print() / click.echo() output
- doctest lines (>>> / ...)
- Django/Flask UPPER_CASE settings defaults
- .md/.rst walkthroughs containing example PEM keys
- JS stack-trace lines in .log files
- f-string / shell interpolation in connection strings
Shared placeholder regex hoisted into [defaults].exclude_pattern_placeholder;
each rule's exclude_pattern references it via the __SHARED_PLACEHOLDERS__
sentinel, which get_default_rules() in config.py string-substitutes at
rule-load time. Adding a new placeholder shape is now a one-line edit rather
than touching 15 rule blocks. No Rust changes needed.
Validation:
- 100-repo sample: 0 misses against independent regex sweep
- 1000-repo sample: ~70% FP reduction, all confirmed real TPs preserved
- 763-repo dual-scan vs TruffleHog 3.95.3 --no-verification:
PySpector 1135 findings, ~78% heuristic precision
TruffleHog 5814 findings, ~8% heuristic precision
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ParzivalHack
approved these changes
May 27, 2026
Owner
ParzivalHack
left a comment
There was a problem hiding this comment.
Perfect PR as always. Merging :D
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds detectors for 24 common credential formats (AWS, GitHub, GitLab, Slack, Stripe, Google, OpenAI, Anthropic/Claude, SendGrid, PostHog, NPM, PyPI, Discord, Telegram, DigitalOcean, Doppler, Cloudflare, Heroku, HubSpot, Fastly, plus DB-connection-string and basic-auth-URL detectors) and significantly reduces false-positive noise from the existing
G101/G101B/G102/G103/G104/AI404rules by extending theirexclude_patternandexclude_file_patternlists. Validated against a 763-repo corpus side-by-side with TruffleHog.New
glpat-[A-Za-z0-9_-]{20}xox[abprso]-[A-Za-z0-9-]{10,}https://hooks.slack.com/services/T<id>/B<id>/<token>AIza[A-Za-z0-9_-]{35}sk-[A-Za-z0-9]{48}and `sk-(projSG\.[A-Za-z0-9_-]{22}\.[A-Za-z0-9_-]{43}phc_[A-Za-z0-9]{40}eyJ…\.eyJ…\.[A-Za-z0-9_-]+(3-part)https?://user:pass@host…(password forbidden to contain/— eliminates JS-stack-trace FPs)npm_[A-Za-z0-9]{36}pypi-AgEIcHlwaS5vcmc[A-Za-z0-9_-]{50,}[MN][A-Za-z0-9]{23}\.[\w-]{6}\.[\w-]{27}\d{8,10}:[A-Za-z0-9_-]{35}v1\.0-[a-f0-9]{32}-[a-f0-9]{146}+ 40-char tokens near "cloudflare" keywordChanged
AI404(Hugging Face): pattern tightened to require at least 16 consecutive alphanumeric chars afterhf_. Eliminates placeholder FPs likehf_token,hf_X,hf_xxx_your_token,hf_..... Doctest lines (>>>/...) excluded.G104(JWT secret): pattern now requires≥16non-quote chars in the value (previously.+matched literal field-name values like"kb_jwt").exclude_patternadded:your_,change-(me|in-production),default-secret,do-not-share,demo-,never-(hardcode|use).G101(broad password/secret):exclude_patternextended to suppress:your_,insert_,example_,placeholder,change-me,replace-me,todo,fake,dummy,sample,demo,server_api_key,api_key_secret,my_password,root_password_here/ containing*_HERE"YOUR_OPENAI_API_KEY"print(,click.echo(,sys.stderr.(instructional output)>>>/...)G101B(uppercase const secret): same placeholder / instructional-line / doctest exclusions.G102(private key block): addedexclude_file_pattern = "*.md,*.rst,*.html,*.txt,*.adoc,*.tex,*.ipynb". Documentation / walkthrough / knowledge-base content showing-----BEGIN … PRIVATE KEY-----as an example was a 100% FP source in our corpus.G103(blank password):exclude_patternadds^\s*[A-Z][A-Z0-9_]+\s*=(Django/Flask uppercase config defaults likeEMAIL_HOST_PASSWORD = ""are intentionally overrideable from env).exclude_file_patternadds*settings*.py,*config*.py.G117,G113: explicit-your-,-here\b,-replace-substring excludes catch patterns likexoxb-your-slack-bot-tokenandsk-svcacct-your-embedding-key-here.exclude_file_patternadds*.env.example,*.env.template,*.env.sample,*.env.dist,env.example.G121 / G121L — production vs dev-default split
G121(Critical / High) now excludes connection strings whose host is one of the well-known local/docker-compose names (localhost,127.0.0.1,0.0.0.0,::1,host.docker.internal,db,database,postgres(ql),mysql,mariadb,mongo(db),redis,rabbitmq,broker,kafka,memcached,amqp). Host tokens are matched only when followed by a URL-component terminator (:,/,?,#, quote, whitespace), so substrings likedb.prod.example.comstill hit G121 — only standalone host tokens like@db:5432get downgraded.G121L(new, Low / Low) covers the dev-default class: same connection-string shape, but only when the host is one of those local/container names. This converts the dominant remaining G121 FP class —postgresql://guaardvark:guaardvark@localhost:5432/guaardvark-style local-dev defaults — into a separate, low-priority signal that an analyst can choose to ignore or batch-review, without dropping the finding entirely (it is still a literal hardcoded credential).[defaults].exclude_pattern_placeholdernow declares the placeholder/dummy-secret regex ((?i)EXAMPLE|FAKE|PLACEHOLDER|SAMPLE|x{10,}|0{10,}|1{10,}|abcdefghij|1234567890abcdef|AbCdEfGhIjKlMnOp|f3a8b2c1) in one place. Each rule'sexclude_patternreferences it via the sentinel__SHARED_PLACEHOLDERS__, whichget_default_rules()(inpyspector/config.py) string-substitutes before handing the TOML text to the Rust core. Adding a new placeholder shape is now a one-line edit rather than touching 15 rule blocks. The Rust core needs no changes — substitution happens in the existing Python rule-loading path. Existing rule TOMLs without the sentinel continue to work unchanged.G122 unscoping
G122previously hadfile_pattern = "*.py". JWTs leak into.yaml,.json,.sh,.tf, and CI configs at least as often as into Python files. Removing the restriction adds new TP coverage without measurable FP impact (2 new hits in the validation corpus, both edge cases in.drawioand.jsonfiles containing image-URL JWTs).Shared FP fixes triggered by the validation corpus
G121/G123now suppress f-string and shell interpolation in the credential portion:{var},{self.x},${VAR},$(VAR),$VAR,<placeholder>,{{ var }}(Jinja/Helm).G121ignoresre.match()/re.compile()/re.search()patterns that happen to describe a connection-string shape.G123pattern now forbids/in the password segment, eliminating the dominant JS-stack-trace FP class (http://localhost:5173/node_modules/.vite/deps/@react.js?…:759:3) @ http://…).*.logadded toexclude_file_pattern.G121/G123add*.env.example,*.env.template,*.tpl,*.j2,*.jinja,*.template,*cookiecutter*toexclude_file_pattern.G114placeholder filter now suppresses Slack webhook URLs withT00000000/B00000000/XXXX…template values.G110suppressesAKIAIOSFOLQUICKSTART(well-known lakefs quickstart documented credential).Validation
Comparison with TruffleHog (v3.95.3) on 763 repos that originally flagged any "Hardcoded" finding.** Both tools scanned the same shallow clones; PySpector with the new rules, TruffleHog with
--no-verificationfor fair format-vs-format comparison.Comparison with PySpector vs Modified PySpector
Per-rule breakdown (500 OK-cloned repo subset)
.md/.rstdoc-extension excludes drop ~30 walkthrough FPsUPPER_CASE = ""config-default and*settings*.pyexcludeshf_AKIA…)AIza…)sk-…/sk-proj-…