You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

CoPE Streaming Probe

A lightweight linear probe for per-token streaming classification using CoPE-A-9B's hidden states. Enables real-time content scoring as text is generated, rather than waiting for the model's final verdict.

Overview

CoPE (Content Policy Evaluator) classifies content against arbitrary policies by generating a binary answer token at the end of a structured prompt. This is accurate but inherently post-hoc โ€” by the time the answer is produced, the full content has already been processed and potentially shown to the user.

This probe operates on CoPE's intermediate hidden states at each token position, producing a streaming probability score that indicates whether a policy violation is developing. It serves as an early warning system that can flag violations before the full content is generated.

Architecture

  • Type: Logistic regression (linear probe)
  • Input: Hidden state at the final layer norm (model.model.norm) โ€” 3,584 dimensions
  • Preprocessing: StandardScaler (zero mean, unit variance)
  • Output: Sigmoid probability โˆˆ [0, 1]
  • Artifacts: 4 numpy arrays totaling ~100 KB

Quick Start

import numpy as np
from huggingface_hub import hf_hub_download

repo = "zentropi-ai/cope-streaming-probe"
coef         = np.load(hf_hub_download(repo, "coef.npy"))
intercept    = np.load(hf_hub_download(repo, "intercept.npy"))
scaler_mean  = np.load(hf_hub_download(repo, "scaler_mean.npy"))
scaler_scale = np.load(hf_hub_download(repo, "scaler_scale.npy"))

def probe_score(hidden_state: np.ndarray) -> float:
    """Score a single hidden state vector."""
    x = (hidden_state - scaler_mean) / scaler_scale
    logit = float(np.dot(coef, x) + intercept[0])
    return 1.0 / (1.0 + np.exp(-logit))

Hidden states are extracted via a forward hook on model.model.norm during CoPE's forward pass.

See the tutorial notebook for a complete working example.

Streaming Usage

Raw per-token scores are noisy โ€” a token might score 0.95 followed by one scoring 0.01. We recommend an exponential moving average (EMA) with a decay factor of ~0.3 to produce a usable streaming signal:

ema = 0.0
decay = 0.3
for hidden_state in stream:
    score = probe_score(hidden_state)
    ema = decay * score + (1 - decay) * ema
    if ema > threshold:
        # Flag potential violation
        ...

The EMA responds quickly to bursts of high-scoring tokens and decays naturally when the probe stops firing. Thresholds should be calibrated per-policy on held-out data.

Training Methodology

Span Labeling

Rather than labeling every token with the sample's final label (which causes the probe to fire from token 1 regardless of content), we use span labeling: for each positive example, we annotate where in the content the violation begins. Tokens before the onset are labeled 0; tokens from the onset onward are labeled 1. Negative examples have all tokens labeled 0.

This teaches the probe to activate only when it sees violating content, not merely because the policy is strict. Onset positions were annotated across ~10,000 positive examples with ~98.5% word-level matching accuracy.

Contrastive Training Data

To prevent the probe from learning topic shortcuts (e.g., "any mention of self-harm is a violation"), we constructed a contrastive training set with two guarantees:

  1. Every piece of content appears with both a positive and negative label (under different policies)
  2. Every policy appears with both positive and negative examples

This forces the probe to learn policy-conditioned features. The final training set comprised 9,845 samples producing 223,843 token vectors.

Results

Metric Value
Answer token F1 0.859
POS answer mean score 0.893
NEG answer mean score 0.184

At the ANSWER token position, the probe essentially matches CoPE's own F1 across an internal benchmark dataset. The streaming mean token scores show excellent separation between POS and NEG content samples, confirming the utility of this probe as an early warning system.

For more details, please see our blog post on our training methodology and actual results.

Requirements

  • CoPE-A-9B (Gemma-2-9B + CoPE LoRA)
  • google/gemma-2-9b as the base model
  • numpy for probe inference
  • transformers, peft, torch for hidden state extraction

Limitations

  • A streaming classifier is inherently less confident than a post-hoc one. It doesn't know what's coming next, and content can go from seemingly violating back to benign (e.g., quoted rather than direct speech). Use the probe as an early warning system, not a final verdict.
  • Short content (under ~10 tokens): the probe may be underconfident because there isn't enough sequential context to build signal. Consider lowering the threshold for short inputs.
  • Streaming thresholds require policy-specific calibration. A threshold of 0.5 may be appropriate for one policy but too aggressive or conservative for another.

Citation

@misc{cope-streaming-probe-2026,
  title={CoPE Streaming Probe: Early Warning for Generative AI Content Classification},
  author={Zentropi},
  year={2026},
  url={https://huggingface.co/zentropi-ai/cope-a-9b-stream-probe}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zentropi-ai/cope-a-9b-stream-probe

Finetuned
(1)
this model