Deploy an offline-first agent that scans software repositories for evidence of feature implementation, producing a tiered confidence report. The agent runs at the edge with zero internet dependency, with optional LLM disambiguation when connectivity is available.
ARGUS is an evidence scanner, not a compliance certifier. It identifies structural indicators that a feature is present and classifies the strength of that evidence. Human reviewers use ARGUS output to prioritize where to focus their assessment, not to replace it.
Users define required features in a YAML configuration file. Each feature specifies what to look for using a layered check architecture, progressing from surface-level indicators to deeper structural analysis:
The agent walks the target software directory, executes checks at each tier, and reports the highest evidence tier reached per feature. Results are output as both machine-readable JSON and a human-readable Markdown report.
ARGUS scores features by the depth of evidence found, not by counting pattern matches. A single Tier 3 signal outweighs any number of Tier 1 signals.
Features that reach only Tier 1 are flagged as low-confidence and marked for human review. Features reaching Tier 3 or Tier 4 are reported as high confidence. This tiered model replaces flat weighted scoring, which conflates quantity of evidence with quality of evidence.
Evidence tier reflects structural depth of analysis (how deep into the codebase ARGUS can verify), while confidence reflects the likelihood that the detected structure corresponds to a valid implementation. A Tier 3 result with low confidence (e.g., due to ambiguous call graph resolution) is reported differently from a Tier 3 result with high confidence. This separation prevents conflating depth of evidence with certainty of evidence.
ARGUS performance is evaluated using precision and recall metrics across labeled datasets. Precision measures the rate of false positives (features incorrectly identified as present), while recall measures the rate of false negatives (features incorrectly identified as absent). F1 score provides a combined measure. Tier-specific accuracy is tracked to assess reliability at each evidence level. Evaluation datasets are composed of known implementations across supported languages, with cross-language performance comparisons to identify systematic gaps in language pack coverage.
Convert the user's feature requirements into the YAML configuration schema. Each requirement becomes a feature entry with checks defined at each applicable tier. For example, a requirement like "must support encrypted data at rest" translates to:
T1: file patterns for encryption modules, dependency checks for cryptography libraries.T2: AST queries confirming encrypt/decrypt functions are called, not just imported.T3: call graph verification that encryption calls are reachable from data write paths.T4: test suite checks for encryption-related test cases and coverage.Each feature must include a structured definition specifying scope, expected system boundaries, and acceptable implementation patterns to reduce variability in interpretation. Required fields include feature name, scope definition, expected entry points, and acceptable implementation patterns. Well-defined features produce consistent results across reviewers; poorly defined features introduce measurement noise that compounds across tiers.
Select pre-built AST query patterns for the languages present in the target software. ARGUS uses tree-sitter for cross-language parsing, shipping language packs for Python, JavaScript/TypeScript, Go, Java, C#, Rust, and C/C++. Each pack includes common check patterns for standard features (authentication, encryption, logging, error handling). Users extend or override these patterns for domain-specific requirements.
This replaces the previous approach of requiring users to author raw regex patterns per language. Regex-based code matching is limited to Tier 1 checks only, as it cannot distinguish between code that exists and code that executes.
Run the agent against known software samples where expected outcomes are already understood. Calibration now operates at two levels:
Calibration samples must be diverse. Tuning against a narrow set of codebases produces an agent that only recognizes implementations structured like the calibration samples. Calibration datasets must include at least five structurally distinct implementations per feature, spanning multiple frameworks and architectural patterns, to reduce overfitting.
Select the runtime appropriate for the target environment: Python for rapid iteration or Go for a single compiled binary with zero runtime dependencies. Tree-sitter language grammars are bundled as shared libraries alongside the agent. Package the agent with its configuration file and language packs for distribution to edge nodes.
When internet access is available, the agent can forward ambiguous findings to any OpenAI-compatible LLM API (Ollama, LM Studio, or cloud providers). The LLM is used strictly for disambiguation, not scoring:
PARTIAL-confidence results are sent to the LLM.This ensures that offline and online runs produce identical results for unambiguous cases. Ambiguous cases are flagged as "needs review" in offline mode rather than silently scored differently.
LLM responses are recorded as advisory signals with associated confidence scores (0–1). Ambiguous results may remain classified as UNCERTAIN rather than forced into binary classification. LLM decisions are logged separately and do not overwrite original evidence tier assignments.
Ground truth labels are established via dual review by domain experts (senior engineers or subject matter specialists), with disagreements resolved through adjudication. Each feature is labeled as Present, Absent, or Ambiguous based on predefined criteria. Expected outcomes are derived from human-labeled datasets, reference implementations, and gold-standard repositories. Labeling criteria define what counts as "feature present" at each evidence tier.
All findings include traceable evidence artifacts, enabling reviewers to independently verify each classification. Each result includes the exact file paths where evidence was found, AST match snippets showing the relevant code structure, and call graph traces demonstrating reachability. This audit trail ensures that no ARGUS result requires trust in a black-box process.
ARGUS is a static evidence scanner. The following limitations are inherent to this approach and cannot be fully resolved without runtime analysis:
The agent performs static analysis only. It does not execute, compile, or runtime-test the target software. Evidence quality depends on the depth of analysis possible for each language (AST and call graph support varies).
Language support requires tree-sitter grammars and pre-built AST query patterns. Languages without a tree-sitter grammar are limited to Tier 1 evidence (file patterns, keywords, and dependency checks only). Supported languages with full Tier 1–4 capability include: Python, JavaScript/TypeScript, Go, Java, C#, Rust, and C/C++. All other languages receive Tier 1 analysis only. When a language lacks tree-sitter support, ARGUS outputs a warning flag in the report indicating degraded analysis depth. Expected degradation behavior is documented per tier for each unsupported language category.
All processing is local. No data leaves the machine unless LLM disambiguation mode is explicitly enabled by the operator. When LLM mode is enabled, only ambiguous code snippets are transmitted, never the full codebase.
All analyses are deterministic given identical inputs, configuration, and language pack versions, enabling reproducible results across environments. Configuration files, AST query patterns, and language pack versions are tracked to ensure that any result can be independently reproduced.
Future validation will benchmark ARGUS against manually reviewed codebases to quantify detection accuracy and identify systematic bias. Benchmarking will measure precision, recall, and F1 across each evidence tier and each supported language. Results will be published alongside the tool to provide users with empirically grounded expectations for detection reliability. The benchmarking dataset will be versioned and made available for independent reproduction.
ARGUS comprises a Python review engine with tree-sitter bindings, a lightweight HTTP server, and a self-contained HTML dashboard. All components run at the edge with zero internet dependency.
However, the current implementation has external dependencies: PyYAML for configuration parsing and tree-sitter language grammars for AST analysis. Installing these requires pip, which requires internet access. This creates a gap in the offline deployment story: a machine that has never been connected to the internet cannot run ARGUS out of the box.
This section evaluates four options for eliminating that gap, each with different tradeoffs around complexity, user experience, and maintainability.
Copy the pure-Python PyYAML source files and pre-compiled tree-sitter grammar binaries directly into the ARGUS project folder. Python's import system finds them locally, eliminating the need for pip entirely. The only prerequisite is a Python 3.x installation on the target machine.
Use PyInstaller or cx_Freeze to bundle the Python runtime, all dependencies, tree-sitter grammar, and the HTML dashboard into a single executable. The end user double-clicks to run. No Python installation required.
Remove YAML support entirely and switch the configuration format to JSON, which Python handles natively. This eliminates the PyYAML dependency but does not address tree-sitter. It must be combined with Option 1 or 2 for full offline AST capability.
Default to JSON for universal compatibility but also accept YAML configuration files if PyYAML is available on the host machine. The system auto-detects the format based on file extension and gracefully degrades to JSON-only when YAML support is absent.
Option 1 (Vendor Dependencies) is the recommended path. It preserves the best user experience: human-readable YAML configuration, full AST analysis capability, no install steps, no format migration. The only requirement is that the target machine has Python installed. The overhead is manageable: vendored Python files plus pre-compiled tree-sitter grammar for each target platform.
For environments where even Python cannot be guaranteed, Option 2 (standalone executable) provides a strong fallback, though it requires a one-time build on a connected machine.
Summary of the four offline-deployment options across the dimensions that matter for an air-gapped or intermittently connected target environment.
| Option | Status | YAML Support | Full AST Analysis | End-User Setup | Build Complexity | Python Required on Target | Best Fit |
|---|---|---|---|---|---|---|---|
| 1. Vendor Dependencies | Recommended | Yes (full) | Yes (Tiers 1–4) | Run script | Low | Yes (3.x) | Most edge environments with Python pre-installed |
| 2. Standalone Executable | Strong | Yes (full) | Yes (Tiers 1–4) | Double-click | High (one-time) | No | Targets without guaranteed Python runtime |
| 3. JSON-Only Config | Viable | No (migrate to JSON) | Only if combined with 1 or 2 | Run script | Low | Yes (3.x) | Strict offline + tolerant of JSON-only config |
| 4. Hybrid JSON + YAML | Conditional | If PyYAML present | If language packs present | Run script | Medium | Yes (3.x) | Mixed-environment fleets with variable connectivity |
Option 1 is the recommended path because it delivers full Tier 1–4 capability with the lowest end-user setup cost in any environment that already has Python. Option 2 is the right fallback when Python on the target cannot be guaranteed. Options 3 and 4 are documented for environments that explicitly require JSON-only configuration or that need to run with or without YAML support depending on the host.
This white paper is also available as a PDF for offline reading and citation. Cite as: Anna R. Dudley, "ARGUS: Automated Review & Grading Utility for Software," annardudley.com, April 2026.