CoPE Streaming Probe
A lightweight linear probe for per-token streaming classification using CoPE-A-9B's hidden states. Enables real-time content scoring as text is generated, rather than waiting for the model's final verdict.
Overview
CoPE (Content Policy Evaluator) classifies content against arbitrary policies by generating a binary answer token at the end of a structured prompt. This is accurate but inherently post-hoc โ by the time the answer is produced, the full content has already been processed and potentially shown to the user.
This probe operates on CoPE's intermediate hidden states at each token position, producing a streaming probability score that indicates whether a policy violation is developing. It serves as an early warning system that can flag violations before the full content is generated.
Architecture
- Type: Logistic regression (linear probe)
- Input: Hidden state at the final layer norm (
model.model.norm) โ 3,584 dimensions - Preprocessing: StandardScaler (zero mean, unit variance)
- Output: Sigmoid probability โ [0, 1]
- Artifacts: 4 numpy arrays totaling ~100 KB
Quick Start
import numpy as np
from huggingface_hub import hf_hub_download
repo = "zentropi-ai/cope-streaming-probe"
coef = np.load(hf_hub_download(repo, "coef.npy"))
intercept = np.load(hf_hub_download(repo, "intercept.npy"))
scaler_mean = np.load(hf_hub_download(repo, "scaler_mean.npy"))
scaler_scale = np.load(hf_hub_download(repo, "scaler_scale.npy"))
def probe_score(hidden_state: np.ndarray) -> float:
"""Score a single hidden state vector."""
x = (hidden_state - scaler_mean) / scaler_scale
logit = float(np.dot(coef, x) + intercept[0])
return 1.0 / (1.0 + np.exp(-logit))
Hidden states are extracted via a forward hook on model.model.norm during CoPE's forward pass.
See the tutorial notebook for a complete working example.
Streaming Usage
Raw per-token scores are noisy โ a token might score 0.95 followed by one scoring 0.01. We recommend an exponential moving average (EMA) with a decay factor of ~0.3 to produce a usable streaming signal:
ema = 0.0
decay = 0.3
for hidden_state in stream:
score = probe_score(hidden_state)
ema = decay * score + (1 - decay) * ema
if ema > threshold:
# Flag potential violation
...
The EMA responds quickly to bursts of high-scoring tokens and decays naturally when the probe stops firing. Thresholds should be calibrated per-policy on held-out data.
Training Methodology
Span Labeling
Rather than labeling every token with the sample's final label (which causes the probe to fire from token 1 regardless of content), we use span labeling: for each positive example, we annotate where in the content the violation begins. Tokens before the onset are labeled 0; tokens from the onset onward are labeled 1. Negative examples have all tokens labeled 0.
This teaches the probe to activate only when it sees violating content, not merely because the policy is strict. Onset positions were annotated across ~10,000 positive examples with ~98.5% word-level matching accuracy.
Contrastive Training Data
To prevent the probe from learning topic shortcuts (e.g., "any mention of self-harm is a violation"), we constructed a contrastive training set with two guarantees:
- Every piece of content appears with both a positive and negative label (under different policies)
- Every policy appears with both positive and negative examples
This forces the probe to learn policy-conditioned features. The final training set comprised 9,845 samples producing 223,843 token vectors.
Results
| Metric | Value |
|---|---|
| Answer token F1 | 0.859 |
| POS answer mean score | 0.893 |
| NEG answer mean score | 0.184 |
At the ANSWER token position, the probe essentially matches CoPE's own F1 across an internal benchmark dataset. The streaming mean token scores show excellent separation between POS and NEG content samples, confirming the utility of this probe as an early warning system.
For more details, please see our blog post on our training methodology and actual results.
Requirements
- CoPE-A-9B (Gemma-2-9B + CoPE LoRA)
- google/gemma-2-9b as the base model
numpyfor probe inferencetransformers,peft,torchfor hidden state extraction
Limitations
- A streaming classifier is inherently less confident than a post-hoc one. It doesn't know what's coming next, and content can go from seemingly violating back to benign (e.g., quoted rather than direct speech). Use the probe as an early warning system, not a final verdict.
- Short content (under ~10 tokens): the probe may be underconfident because there isn't enough sequential context to build signal. Consider lowering the threshold for short inputs.
- Streaming thresholds require policy-specific calibration. A threshold of 0.5 may be appropriate for one policy but too aggressive or conservative for another.
Citation
@misc{cope-streaming-probe-2026,
title={CoPE Streaming Probe: Early Warning for Generative AI Content Classification},
author={Zentropi},
year={2026},
url={https://huggingface.co/zentropi-ai/cope-a-9b-stream-probe}
}