CI Caching Troubleshooting Guide
Common Issues
Section titled “Common Issues”Cache Not Working (Jobs Always Running)
Section titled “Cache Not Working (Jobs Always Running)”Symptoms:
- Jobs run on every commit despite no relevant changes
should-runoutput alwaystrue
Diagnosis:
# Check composite action logs in GitHub Actions UI# Look for "Cache hit" vs "No cache found" messagesCommon causes:
-
Check name mismatch
- Logs show: “Check name validation failed”
- Solution: Verify check-name parameter matches GitHub’s format
- Template:
job-{hash}orjob-{hash} (param1, param2)
-
Path filters too narrow
- Logs show: “No relevant changes detected”
- Solution: Review path-filters regex, ensure captures all dependencies
- Example: Missing transitive dependencies
-
GitHub Checks API failure
- Logs show: “API query failed after 3 attempts”
- Solution: Wait and retry, check GitHub status page
- Temporary issue: Jobs run as failsafe
Resolution steps:
# 1. Check recent workflow runsgh run list --workflow=ci.yaml --limit=5
# 2. View specific run logsgh run view <run-id> --log
# 3. Search for cache decision logsgh run view <run-id> --log | grep "should-run"
# 4. Force run to testgh workflow run ci.yaml -f force_run=trueJobs Skipped When They Should Run
Section titled “Jobs Skipped When They Should Run”Symptoms:
- Changes made but job skipped
- Test failures not caught until later
Diagnosis:
# Check if check run exists from previous commitgh api repos/$OWNER/$REPO/commits/$SHA/check-runs \ --jq '.check_runs[] | select(.name | contains("job-name"))'Common causes:
-
Stale check from force-push
- Old check run still present after force-push
- Solution: Wait 24 hours (automatic expiration)
- Workaround: Use force_run=true parameter
-
Path filters too broad
- Filters don’t detect relevant changes
- Solution: Make filters more precise
- Example:
\.nix$misses.nix.examplefiles
-
Workflow definition changed
- Config hash changed, check name different
- Solution: This should auto-invalidate (check implementation)
Resolution steps:
# 1. Force run specific jobgh workflow run ci.yaml -f job=job-name -f force_run=true
# 2. Check path filters match changed filesgit diff --name-only HEAD~1 | grep -E 'your-filter-regex'
# 3. Verify check name format# Look in GitHub UI: Actions → Workflow Run → Check namesRate Limit Errors
Section titled “Rate Limit Errors”Symptoms:
- Logs show: “GitHub API rate limit exceeded”
- Multiple workflows failing simultaneously
Diagnosis:
# Check rate limit statusgh api rate_limitSolution:
-
Wait for reset:
Terminal window # Check reset time (shown in error logs)# Rate limit resets: [timestamp] -
Reduce workflow frequency:
- Combine multiple small commits
- Use draft PRs for work-in-progress
- Disable workflow on WIP branches
-
Temporary workaround:
Terminal window # Use workflow_dispatch with selective jobsgh workflow run ci.yaml -f job=specific-job
Production Deployment with Stale Results
Section titled “Production Deployment with Stale Results”Symptoms:
- Production release succeeded but builds were cached
- Tests didn’t actually run
Expected behavior:
- Main branch: All jobs forced to run fresh
- Production releases: Always use force_run=true
Diagnosis:
# Check if production job dependency was skippedgh run view <run-id> --log | grep "typescript.*skipped"gh run view <run-id> --log | grep "nix.*skipped"This should never happen after Agent 2 implementation:
- Main branch forces fresh builds
- Production requires success (not skipped)
If it happens:
# 1. Check force-run parametergrep "force-run" .github/workflows/ci.yaml
# 2. Verify job conditionsgrep -A5 "production-release-packages" .github/workflows/ci.yaml
# 3. Immediately force fresh buildgh workflow run ci.yaml -f force_run=trueDebugging Techniques
Section titled “Debugging Techniques”View Cache Decision Process
Section titled “View Cache Decision Process”# Get workflow run logsgh run view <run-id> --log > run.log
# Search for cache decision stepsgrep "=== Execution Decision ===" run.log -A10
# Search for validation stepsgrep "=== Check Name Validation ===" run.log -A10
# Search for API callsgrep "Querying execution history" run.log -A5Verify Check Name Format
Section titled “Verify Check Name Format”# List all checks for a commitgh api repos/$OWNER/$REPO/commits/$SHA/check-runs \ --jq '.check_runs[].name' | sort
# Compare with expected format# Expected: job-{8-hex-chars} (params)# Example: nix-a1b2c3d4 (packages, x86_64-linux)Test Composite Action Locally
Section titled “Test Composite Action Locally”# Run test workflowgh workflow run test-composite-actions.yaml
# View resultsgh run list --workflow=test-composite-actions.yaml --limit=1gh run view <run-id> --logCheck Configuration Hash
Section titled “Check Configuration Hash”# Hash should change when workflow/action changesgit log --oneline -1 .github/workflows/ci.yamlgit log --oneline -1 .github/actions/cached-ci-job/action.yaml
# Check name should include hashgh run view <run-id> --log | grep "Resolved check name"Emergency Procedures
Section titled “Emergency Procedures”Disable Caching Globally
Section titled “Disable Caching Globally”If caching is causing critical issues:
# Edit composite actiongit checkout -b disable-cache
# In .github/actions/cached-ci-job/action.yaml# Change decide step to always return:echo "should-run=true" >> $GITHUB_OUTPUT
# Commit and pushgit commit -am "temp: disable CI caching"git push -u origin disable-cache
# Create emergency PRgh pr create --title "EMERGENCY: Disable CI caching" --body "Debugging cache issue"Force All Jobs to Run
Section titled “Force All Jobs to Run”# For specific PRgh workflow run ci.yaml -f force_run=true
# For main branch (production)git checkout maingit commit --allow-empty -m "force: trigger fresh builds"git pushPrevention
Section titled “Prevention”Before Changing Workflows
Section titled “Before Changing Workflows”-
Test in PR first:
Terminal window # Make changes in feature branch# Push and observe cache behavior# Verify check names in GitHub UI -
Use test workflow:
Terminal window gh workflow run test-composite-actions.yaml --ref feature-branch -
Monitor first few runs:
- Check for validation failures
- Verify cache hits/misses as expected
- Review new check name formats
Regular Audits
Section titled “Regular Audits”# Monthly review of cache effectivenessgh run list --workflow=ci.yaml --limit=50 --json conclusion,name,durationMs
# Check for anomalies# - Duration spikes (cache not working)# - All jobs completing too fast (over-caching)Getting Help
Section titled “Getting Help”If troubleshooting doesn’t resolve the issue:
- Check documentation: Review ADR-0016 for architecture details
- Review recent changes:
git log --oneline .github/actions/cached-ci-job/ - File an issue: Include workflow run URL and relevant logs
- Emergency contact: Disable caching as interim solution