AI-Assisted Code Review Pipeline

The problem

Code review is a high-value activity that scales poorly. As teams grow, the throughput of reviews becomes a bottleneck — either reviews are rushed and miss real issues, or they become a queue that blocks delivery. Senior engineers spend disproportionate time reviewing code that does not require their judgment, which means less time for the reviews that do.

The hypothesis: an LLM-augmented review step running before human review could handle the mechanical concerns — obvious security patterns, missing error handling, test coverage gaps — and let human reviewers focus on architecture and design intent.

What this lab built

A GitHub Actions workflow that triggers on pull request open and runs a structured analysis across three dimensions:

Security patterns — scans the diff for common vulnerability classes: injection risks in SQL or shell construction, secrets that may have been accidentally included, authentication boundary violations, and dependency additions without lockfile updates.

Architectural concerns — flags structural issues relative to the existing codebase: components growing beyond a single responsibility, cross-layer dependencies that violate the module boundary, and divergence from established patterns in adjacent files.

Test coverage signals — identifies changed code paths that have no corresponding test changes, new exported functions without test files, and edge cases in the diff that are likely to fail without explicit test coverage.

The output is a structured comment on the pull request with findings grouped by severity, each linked to the specific diff line it references.

Architecture decisions

The analysis runs against the diff, not the full codebase. This was a deliberate constraint — full-codebase analysis would require a large context window and significant latency, and most review concerns are local to the changed code. Architectural concerns that require full-codebase context are flagged as "needs human review" rather than analyzed automatically.

The workflow runs in a read-only mode with no write access to the repository. The LLM produces the comment text, but posting is handled by the Actions runner using a scoped token. This keeps the AI system out of the write path.

A cost budget is enforced per run. If the diff exceeds the budget threshold (approximately 400 lines of changed code), the analysis is chunked and the results are merged. This caps the per-PR cost without sacrificing coverage on large diffs.

What I learned

False positives are the killer. The first version flagged too many things as concerns. Engineers stopped reading the comments after two days. The second version added a confidence filter and reduced output to findings above 0.85 confidence — which cut the volume by 60% and made the remaining findings credible.

LLMs are good at pattern matching, not reasoning about intent. The security pattern detection is reliable because it is pattern-based. The architectural concern detection is weaker because it requires understanding design intent, which is not available in the diff alone. Giving the LLM access to the PR description and linked ticket significantly improved architectural analysis quality.

The best outcome is changing reviewer behavior, not replacing reviewers. The pipeline is most valuable when it changes what human reviewers look at — directing attention to the high-judgment concerns and away from the mechanical ones. Framing it as a reviewer assistant, not a replacement, was important for adoption.

Status