fix(github): create domain accounts for non-committer authors (#8886)#8894
fix(github): create domain accounts for non-committer authors (#8886)#8894JAORMX wants to merge 2 commits into
Conversation
| github:GithubAccount:1:7496278,i@andypan.me,Andy Pan,panjf2000,https://avatars.githubusercontent.com/u/7496278?v=4,,"{""ConnectionId"":1,""Name"":""panjf2000/ants""}",_raw_github_api_accounts,5, | ||
| github:GithubAccount:1:8518239,badger@gitter.im,The Gitter Badger,gitter-badger,https://avatars.githubusercontent.com/u/8518239?v=4,,"{""ConnectionId"":1,""Name"":""panjf2000/ants""}",_raw_github_api_accounts,9, | ||
| github:GithubAccount:1:964542,sarath.sp06@gmail.com,Sarath Sadasivan Pillai,sarathsp06,https://avatars.githubusercontent.com/u/964542?v=4,"exotel,leadmrktr,shellagilehub,odysseyhack,boodltech","{""ConnectionId"":1,""Name"":""panjf2000/ants""}",_raw_github_api_accounts,1, | ||
| github:GithubAccount:1:1052632,runner.mei@,runner,runner-mei,https://avatars.githubusercontent.com/u/1052632?v=4,,,,0, |
There was a problem hiding this comment.
_raw_data fields are missing, please fix it. Thanks.
There was a problem hiding this comment.
Oops, will do when I'm back on my computer
There was a problem hiding this comment.
Done in b67aad7! The converter now carries the _raw_data fields over: it prefers the enriched _tool_github_accounts row when we collected a profile, and falls back to the _tool_github_repo_accounts row for the non-committer case (that's why milichev shows _raw_github_api_issues provenance). The e2e fixture was missing the _raw_data columns too, so I added those and regenerated the snapshot... the enriched rows now carry exactly the same provenance as before this PR.
While at it, I also guarded pr_convertor.go against generating account ids for a zero AuthorId/MergedById. An unmerged PR would otherwise get merged_by_id pointing at github:GithubAccount:1:0, which is the same orphan shape this PR is fixing (issue_convertor already had that guard).
Verified the full github e2e suite on both MySQL and PostgreSQL.
…#8886) ConvertAccounts sourced the domain `accounts` table FROM _tool_github_accounts, which is only populated for users we collected full profiles for (effectively, committers). Issue and PR authors who never committed were written into _tool_github_repo_accounts but never converted, so issues.creator_id and pull_requests.author_id pointed at accounts rows that didn't exist. Source ConvertAccounts FROM _tool_github_repo_accounts LEFT JOIN _tool_github_accounts instead, so every user the repo references gets a domain account, enriched with profile detail when we have it and login-only otherwise. The domain id uses the same generator the issue/PR convertors use, so the FKs line up. Also emit a repo_account for a PR's merged_by user so pull_requests.merged_by_id resolves too. The query stays MySQL/PostgreSQL-agnostic (COALESCE, no backtick quoting, parameterized via the dal) and mirrors the join already in account_org_collector.go. Adds the non-committer orphan case to the e2e fixture plus a referential- integrity assertion in TestAccountDataFlow. Verified on both MySQL and PostgreSQL. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review feedback on apache#8894: the rewritten ConvertAccounts dropped the _raw_data fields from converted accounts. Select them with COALESCE, preferring the enriched _tool_github_accounts row and falling back to the _tool_github_repo_accounts row for non-committers. The e2e fixture now carries the _raw_data columns like every other tool fixture, and the regenerated snapshot pins the pre-existing provenance for enriched accounts plus the issue-extractor provenance for the non-committer case. Also guard PR author and merged-by id generation against zero ids: an unmerged PR or a deleted user otherwise yields the domain id github:GithubAccount:<conn>:0, the same orphan-FK shape this PR fixes. issue_convertor already guards its AuthorId the same way. Verified with the full github e2e suite on both MySQL and PostgreSQL. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
b5805d5 to
b67aad7
Compare
Problem
issues.creator_idandpull_requests.author_idare written for every author, but a domainaccountsrow is only created for users we collected a full profile for (effectively, committers). Authors who only filed an issue, or opened a PR without committing, become orphan foreign keys:creator_idis set andcreator_nameismilichev, buta.user_nameis NULL. Bot filters keying onaccounts.user_name LIKE '%[bot]'miss those rows for the same reason.Fixes #8886.
Root cause
The issue and PR extractors already record every author in
_tool_github_repo_accounts. ButConvertAccountssourced the domainaccountstableFROM _tool_github_accounts, which is only populated for users we fetched full profiles for. So the convertors generatecreator_id/author_idfor everyone, whileaccountsonly ever gets the committers.Change
ConvertAccountsnow readsFROM _tool_github_repo_accounts LEFT JOIN _tool_github_accounts, so every user the repo references becomes a domain account: enriched with name/email/avatar when we have the detail, login-only otherwise. The domain id uses the samedidgengenerator the issue/PR convertors use, so the foreign keys line up.pr_extractoralso emits a repo_account for themerged_byuser, which fixespull_requests.merged_by_id(that user wasn't recorded anywhere before).account_org_collector.go.Testing
milichev, referenced by the repo with no detail row. It now gets a login-onlyaccountsrow.TestAccountDataFlow: every account the repo references must resolve to a domainaccountsrow, generated with the same id generator the convertors use. Verified it fails against the old converter and passes with the fix.plugins/github/e2esuite passes on both MySQL and PostgreSQL.Out of scope
Plural issue/PR assignees and PR requested reviewers aren't seeded into
_tool_github_repo_accounts, so those FKs can still be unresolved. Thegithub_graphqlplugin likely shares the root cause. Happy to follow up on these separately.🤖 Generated with Claude Code