Instructions to use hypn05/secrets-sentinel with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use hypn05/secrets-sentinel with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="hypn05/secrets-sentinel")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("hypn05/secrets-sentinel") model = AutoModelForSequenceClassification.from_pretrained("hypn05/secrets-sentinel") - Notebooks
- Google Colab
- Kaggle
Secrets Sentinel — v5.0.0
Context-aware AI secret detection for CI/CD pipelines, pre-commit hooks, and code scanners.
Fine-tuned DeBERTa-v3-base · F1 = 0.9999 · Precision = 1.0 · <500ms inference · Runs fully on-prem.
What's New in v5.0.0
v5 is a full fine-tune of all 184M parameters on data_v10 (1.14M training lines, 195 negative patterns), fixing the remaining false negatives from v4 and further reducing false positives:
| Fix | Example |
|---|---|
CloudFormation Default: passwords |
Default: MyS3cr3tPass! now correctly flagged |
Redis --requirepass in docker-compose |
command: redis-server --requirepass s3cr3t now correctly flagged |
Combined with all v4 false-positive fixes:
| Pattern (safe, not flagged) | Example |
|---|---|
| GitHub Actions pinned SHAs | uses: docker/login-action@5e57cd118...ef # v3.6.0 |
.env.example placeholders |
REDIS_PASSWORD=null, DB_PASSWORD=YOURPASSWORD |
| PHP config null values | 'smtp_password' => null, |
| PHP hashids alphabets | 'alphabet' => 'XKyIAR7mgt8jD2vbqPrOSVenNG...' |
Laravel route uses key |
'uses' => 'Auth\SamlController@login' |
| Test email addresses | enableAdminCC('cc@example.com') |
| Git commit SHAs in test strings | "foo [link user@be6a8cc1c1ec...]" |
| Spring annotation values | value = "LDAPInjectionVulnerability" |
| JWT expired test payloads | {"iat": 1508639612, "exp": 9999999999} |
| DAV/iCal date fields | 'BDAY' => '20251106' |
| Public installer curls | curl -fsSL https://getcomposer.org/installer |
| Localhost service URLs | curl http://localhost:8080/health |
| PHP redirect assertions | $response->assertRedirect('/login') |
| Dockerfile ENV var refs | ENV DB_PASSWORD="${DB_PASSWORD}" |
The Problem
Secrets pushed to repositories create a critical, expensive security incident:
- Credential rotation is mandatory the moment a key hits git history — even for 1 second
- Regex scanners miss generic secrets (
password = "s3cr3t!"has no known pattern) - LLMs are too slow for pre-receive hooks with 5-second time limits
- False positives kill adoption — developers disable scanners that cry wolf
Why This Model?
| Approach | Speed | Generic Secrets | FP Rate | Cost |
|---|---|---|---|---|
| Regex (gitleaks, trufflehog) | ⚡ Fast | ✗ Pattern-only | High | Free |
| Large LLMs (GPT-4, Claude) | 🐢 >30s | ✓ Excellent | Low | High |
| Secrets Sentinel (this model) | ⚡ <500ms | ✓ Excellent | Very Low | Free |
Key advantage: Context-aware inference. password = os.environ.get('DB_PASS') is safe; password = "hunter2" is not. Regex tools can't tell the difference.
Model Details
| Property | Value |
|---|---|
| Architecture | DeBERTa-v3-base (Microsoft) |
| Parameters | 86M |
| Task | Binary sequence classification |
| Input | Single code line or diff line (max 128 tokens) |
| Output | LABEL_0 = safe, LABEL_1 = secret detected |
| Max line length | 128 tokens (~500 characters) |
| Inference (GPU) | ~100–200ms/line, ~5ms/line batched |
| Inference (CPU) | ~500ms/line, ~50ms/line batched |
| Model size | ~750 MB (safetensors), ~244 MB (ONNX INT8) |
| GPU memory | ~800 MB (inference) |
| License | MIT |
Training Data (v5)
| Source | Lines | Label |
|---|---|---|
| Synthetic generator (data_v10, 162 positive + 195 negative patterns) | ~1,106,000 | Mixed |
| Real-world labeled examples (OWASP apps, scanner fixtures, vuln repos) | ~37,900 | Labeled |
| Total | ~1,144,000 | — |
Evaluation History
| Version | F1 | Precision | Recall | Accuracy | eval_loss |
|---|---|---|---|---|---|
| v1.0.0 | 0.9910 | 0.9905 | 0.9915 | 0.9920 | — |
| v2.0.0 | 0.9976 | 0.9975 | 0.9977 | 0.9977 | 0.0112 |
| v3.0.0 | 0.9994 | 0.9994 | 0.9994 | 0.9995 | 0.0051 |
| v4.0.0 | 0.9992 | 0.9983 | 1.0000 | 0.9992 | ~0.001 |
| v5.0.0 | 0.9999 | 1.0000 | 0.9998 | 0.9999 | ~0.000 |
v5 achieves Precision = 1.0 (zero false positives on the held-out eval set) with near-perfect recall, trained as a full fine-tune on data_v10 which incorporates all confirmed real-world FP and FN patterns at scale.
Quick Start
Simplest Usage
from transformers import pipeline
detector = pipeline("text-classification", model="hypn05/secrets-sentinel")
lines = [
"AWS_SECRET_ACCESS_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'",
"password = os.environ.get('DB_PASSWORD')",
"api_key = 'sk-proj-abc123def456ghi789'",
"uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef",
"DB_PASSWORD=null",
]
for line, result in zip(lines, detector(lines)):
label = "SECRET" if result["label"] == "LABEL_1" else "safe "
print(f"[{label}] {result['score']:.1%} {line[:70]}")
Output:
[SECRET] 100.0% AWS_SECRET_ACCESS_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
[safe ] 99.8% password = os.environ.get('DB_PASSWORD')
[SECRET] 99.9% api_key = 'sk-proj-abc123def456ghi789'
[safe ] 99.9% uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef
[safe ] 99.7% DB_PASSWORD=null
Production Class (Batched, GPU-Optimised)
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from typing import List, Tuple
class SecretsScanner:
"""
Production-ready secret scanner with batched inference.
Designed for pre-receive hooks (5-second budget) and CI/CD pipelines.
"""
def __init__(self, threshold: float = 0.85, device: str = None, batch_size: int = 256):
self.threshold = threshold
self.batch_size = batch_size
self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
model_id = "hypn05/secrets-sentinel"
self.tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
self.mdl = AutoModelForSequenceClassification.from_pretrained(
model_id,
torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
).to(self.device).eval()
def scan(self, lines: List[str]) -> List[Tuple[str, float, bool]]:
"""
Returns list of (line, confidence, is_secret) tuples.
Only lines with confidence >= threshold are flagged.
"""
results = []
for i in range(0, len(lines), self.batch_size):
batch = lines[i : i + self.batch_size]
enc = self.tok(batch, padding=True, truncation=True,
max_length=128, return_tensors="pt").to(self.device)
with torch.inference_mode():
probs = torch.softmax(self.mdl(**enc).logits, dim=1)[:, 1]
for line, prob in zip(batch, probs.cpu().tolist()):
results.append((line, prob, prob >= self.threshold))
return results
def scan_diff(self, diff: str) -> List[Tuple[str, float]]:
"""Scan a git diff — only checks added lines (starting with +)."""
added = [l[1:].strip() for l in diff.splitlines()
if l.startswith("+") and not l.startswith("+++")]
return [(line, conf) for line, conf, flag in self.scan(added) if flag]
Integration Examples
1. Git Pre-Receive Hook
#!/usr/bin/env python3
# Save as: .git/hooks/pre-receive (chmod +x)
import sys, subprocess, torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
THRESHOLD = 0.85
MODEL_ID = "hypn05/secrets-sentinel"
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
mdl = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval()
def check_push(old, new, ref):
diff = subprocess.check_output(
["git", "diff", old, new, "--unified=0"], text=True, errors="ignore"
)
added = [l[1:] for l in diff.splitlines() if l.startswith("+") and not l.startswith("+++")]
if not added:
return []
enc = tok(added, padding=True, truncation=True, max_length=128, return_tensors="pt")
probs = torch.softmax(mdl(**enc).logits, dim=1)[:, 1].tolist()
return [(line, p) for line, p in zip(added, probs) if p >= THRESHOLD]
secrets = []
for line in sys.stdin:
old, new, ref = line.split()
secrets.extend(check_push(old, new, ref))
if secrets:
print(f"\n\033[31m✗ PUSH REJECTED — {len(secrets)} secret(s) detected:\033[0m")
for line, conf in secrets[:5]:
print(f" [{conf:.0%}] {line[:100]}")
sys.exit(1)
print("\033[32m✓ No secrets detected.\033[0m")
2. GitHub Actions Workflow
# .github/workflows/secret-scan.yml
name: Secret Detection
on:
pull_request:
push:
branches: [main, master, develop]
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install transformers torch --quiet
- name: Scan for secrets
run: |
python3 - <<'EOF'
import subprocess, sys, torch
from transformers import pipeline
detector = pipeline("text-classification", model="hypn05/secrets-sentinel",
device=0 if torch.cuda.is_available() else -1)
base = subprocess.check_output(
["git", "merge-base", "HEAD", "origin/main"], text=True
).strip()
diff = subprocess.check_output(
["git", "diff", base, "HEAD", "--unified=0"], text=True, errors="ignore"
)
lines = [l[1:] for l in diff.splitlines()
if l.startswith("+") and not l.startswith("+++") and l[1:].strip()]
if not lines:
print("No changed lines to scan.")
sys.exit(0)
results = detector(lines, batch_size=64, truncation=True, max_length=128)
findings = [(l, r["score"]) for l, r in zip(lines, results)
if r["label"] == "LABEL_1" and r["score"] >= 0.85]
if findings:
print(f"::error::Found {len(findings)} potential secret(s)!")
for line, conf in findings:
print(f" [{conf:.0%}] {line[:120]}")
sys.exit(1)
print(f"✓ Scanned {len(lines)} lines — no secrets detected.")
EOF
3. Pre-Commit Hook (local development)
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: secrets-sentinel
name: Secrets Sentinel
language: python
entry: python3 -c "
import sys, torch
from transformers import pipeline
detector = pipeline('text-classification', model='hypn05/secrets-sentinel')
lines = [l for l in sys.stdin.read().splitlines() if l.strip()]
results = detector(lines, batch_size=32, truncation=True, max_length=128)
found = [(l,r) for l,r in zip(lines,results) if r['label']=='LABEL_1' and r['score']>0.85]
if found:
print(f'BLOCKED: {len(found)} secret(s) detected')
sys.exit(1)
"
pass_filenames: false
always_run: true
4. Python Scanner Script
#!/usr/bin/env python3
"""scan_repo.py — scan an entire directory for hardcoded secrets."""
import sys
from pathlib import Path
from transformers import pipeline
SKIP_EXTS = {".lock", ".min.js", ".map", ".png", ".jpg", ".pdf"}
SKIP_DIRS = {"node_modules", ".git", "vendor", "__pycache__", "dist", "build"}
THRESHOLD = 0.85
detector = pipeline("text-classification", model="hypn05/secrets-sentinel")
root = Path(sys.argv[1] if len(sys.argv) > 1 else ".")
findings = []
for path in sorted(root.rglob("*")):
if not path.is_file():
continue
if any(d in path.parts for d in SKIP_DIRS):
continue
if path.suffix.lower() in SKIP_EXTS:
continue
try:
lines = [l.rstrip() for l in path.read_text(errors="ignore").splitlines()
if 4 <= len(l.strip()) <= 500]
except Exception:
continue
if not lines:
continue
results = detector(lines, batch_size=64, truncation=True, max_length=128)
for ln, (line, result) in enumerate(zip(lines, results), 1):
if result["label"] == "LABEL_1" and result["score"] >= THRESHOLD:
findings.append((path, ln, line, result["score"]))
if findings:
print(f"\nFound {len(findings)} potential secret(s):\n")
for path, ln, line, conf in sorted(findings, key=lambda x: -x[3]):
print(f" [{conf:.0%}] {path}:{ln}")
print(f" {line[:100]}")
else:
print("No secrets detected.")
5. CPU-Optimised ONNX Version
For CPU-only environments (CI agents, lightweight containers):
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
import torch
# ~244 MB vs 750 MB, ~4× faster on CPU
model_id = "hypn05/secrets-sentinel-cpu"
tok = AutoTokenizer.from_pretrained(model_id)
model = ORTModelForSequenceClassification.from_pretrained(model_id)
lines = ["password = 'hunter2'", "name = os.environ.get('USER')"]
enc = tok(lines, padding=True, truncation=True, max_length=128, return_tensors="pt")
probs = torch.softmax(model(**enc).logits, dim=1)[:, 1]
for line, prob in zip(lines, probs):
print(f"[{'SECRET' if prob >= 0.85 else 'safe '}] {prob:.1%} {line}")
What It Detects
The model understands context, not just patterns. It catches generic secrets that regex tools miss:
True Positives (correctly flagged)
# Hardcoded passwords
password = "s3cr3tP@ssw0rd#2024"
'password' => 'Edward#2025'
MYSQL_ROOT_PASSWORD: hunter2
# API keys & tokens
OPENAI_API_KEY = "sk-proj-abc123..."
ghp_aBcDeFgHiJkLmNoPqRsTuVwXyZ123456 # GitHub PAT
STRIPE_SECRET_KEY = "sk_live_abc123..."
AWS_SECRET_ACCESS_KEY = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
# Database connection strings
DATABASE_URL = "postgresql://admin:secret@db.prod.internal/myapp"
MONGODB_URI = "mongodb://root:p@ssword@cluster.mongodb.net/prod"
# Private keys
private_key = "-----BEGIN RSA PRIVATE KEY-----"
ssh_key = "-----BEGIN OPENSSH PRIVATE KEY-----"
# Webhook secrets
WEBHOOK_SECRET = "whsec_abc123xyz789"
SLACK_SIGNING_SECRET = "abc123def456ghi789"
True Negatives (correctly ignored)
# Environment variable references (not hardcoded)
password = os.environ.get("DB_PASSWORD")
api_key = process.env.API_KEY
secret = System.getenv("JWT_SECRET")
# Template / example files
DB_PASSWORD=null
REDIS_PASSWORD=YOURPASSWORD
MAIL_PASSWORD=changeme
# Code comments and logging
# TODO: add password validation
logger.info("Checking password strength")
# GitHub Actions (pinned SHAs are NOT secrets)
uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef
# Test fixtures with obvious placeholders
expect(user.email).toBe("test@example.com")
DB_HOST=localhost
Performance Benchmarks
| Scenario | Device | Latency | Throughput |
|---|---|---|---|
| Single line | CPU | ~500ms | — |
| Single line (cached) | CPU | ~50ms | — |
| Batch 64 lines | CPU | ~800ms | 80 lines/s |
| Single line | GPU (A100) | ~8ms | — |
| Batch 256 lines | GPU (A100) | ~120ms | 2,100 lines/s |
| Batch 256 lines | GPU (T4) | ~350ms | 730 lines/s |
| ONNX INT8, batch 64 | CPU | ~200ms | 320 lines/s |
For pre-receive hooks: A typical 200-line diff scans in <2 seconds on CPU (batch mode).
Limitations
- Single-line context only — analyzes one line at a time (128-token max). Multi-line PEM blocks need a dedicated regex check alongside this model.
- Context window — very long lines are truncated at 128 tokens.
- Custom internal formats — proprietary secret formats not seen during training may be missed. Fine-tune on your own labeled examples for best results.
- Redacted patterns —
REDACTED,***,<secret>placeholders are intentionally ignored.
Related Resources
- CPU/ONNX version: hypn05/secrets-sentinel-cpu — INT8-quantized, ~244 MB, 4× smaller
- Base model: microsoft/deberta-v3-base
- Similar tools: gitleaks, trufflehog, detect-secrets
FAQ
Q: How does it compare to regex-based tools like gitleaks or trufflehog?
A: Those tools excel at known-format secrets (AWS AKIA keys, GitHub ghp_ tokens). This model catches generic secrets — any password = "..." or token = "..." regardless of format — at the cost of slightly higher false positive rates on edge cases. Best results come from using both: regex tools for known formats, this model for everything else.
Q: Can I fine-tune it on my organisation's patterns?
A: Yes — the base model is MIT-licensed. Label a few hundred examples of your internal secret formats and fine-tune. The architecture (DeBERTa-v3) is well-suited to incremental adaptation.
Q: Does it send my code anywhere?
A: No. Run the model locally or on your own infrastructure. Inference is fully on-premise — only model weights are downloaded from HuggingFace once.
Q: What confidence threshold should I use?
A: 0.85 for pre-receive hooks (low FP, may miss some edge cases). 0.60 for retrospective scans where you want higher recall and can accept some FP for manual review.
Q: How often is the model updated?
A: Continuously — each version is trained on newly identified FP and FN patterns from real-world scanning. v5 specifically addresses CloudFormation Default: password lines and Redis --requirepass docker-compose commands, while v4 addressed GitHub Actions workflow files, PHP config templates, and Laravel test fixtures.
Citation
@model{secrets_sentinel_2026,
title = {Secrets Sentinel: Context-Aware Secret Detection for CI/CD Pipelines},
author = {hypn05},
year = {2026},
url = {https://huggingface.co/hypn05/secrets-sentinel},
note = {Fine-tuned DeBERTa-v3-base for binary secret/safe line classification}
}
License
MIT — free to use, modify, and deploy in commercial systems. Attribution appreciated.
Built to stop the #1 cause of cloud credential theft: secrets accidentally committed to git.
- Downloads last month
- 520
Model tree for hypn05/secrets-sentinel
Space using hypn05/secrets-sentinel 1
Evaluation results
- F1 Score on Synthetic + Real-World Labeled (data_v10 + labeled_data.jsonl)test set self-reported1.000
- Precision on Synthetic + Real-World Labeled (data_v10 + labeled_data.jsonl)test set self-reported1.000
- Recall on Synthetic + Real-World Labeled (data_v10 + labeled_data.jsonl)test set self-reported1.000
- Accuracy on Synthetic + Real-World Labeled (data_v10 + labeled_data.jsonl)test set self-reported1.000