Gnosis — Qwen3-4B-Instruct-2507 (Self-Awareness Correctness Head)

Gnosis is a lightweight self-awareness mechanism introduced in the paper Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits. It consists of a specialized head attached to a frozen LLM backbone (in this case, Qwen3-4B-Instruct-2507) that predicts a scalar correctness probability for a generated response.

The model reads the backbone’s internal signals—hidden-state features (latent dynamics) and attention-map patterns—to decode reliable correctness cues directly from the generation process.

Paper: Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
Repository: GitHub - Amirhosein-gh98/Gnosis

Why it matters

Strong verifier signal without a large external reward model (no RM routing / no judge LLM calls).
~1000× smaller than 8B reward-model verifiers (**5M params vs ~8B**).
~100× faster than routing through an ~8B reward model.
Early error detection: can flag likely errors before generation finishes.

Evaluated backbones & benchmarks (from the paper)

Backbones: Qwen3 family + OpenAI gpt-oss-20B.
Benchmarks: Math-Reasoning (AMC12 2022/2023, AIME 2024/2025, HMMT Feb 2025), Open-Domain QA (18k held-out TriviaQA), Academic Knowledge Reasoning (MMLU-Pro).

Training data

Mixed math + trivia training corpus:

Math: English portion of DAPO-Math-17k (~14k).
Trivia: 40k subsample from TriviaQA training set.

Usage (inference)

This repo requires the local Transformers fork with Gnosis integrated into the model architecture (see the GitHub repo instructions). After installing it, run:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.demo import build_chat_prompt, generate_with_hf, correctness_prob

GNOSIS_MODEL_ID = "AmirhoseinGH/Gnosis-Qwen3-4B-Instruct-2507"

tokenizer = AutoTokenizer.from_pretrained(GNOSIS_MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    GNOSIS_MODEL_ID, torch_dtype=torch.bfloat16, trust_remote_code=True
).cuda().eval()

prompt = build_chat_prompt(
    tokenizer,
    question="How many r's are in strawberry?",
    system_prompt="Please reason step by step, and put your final answer within \\boxed{}.",
)

answer = generate_with_hf(model, tokenizer, prompt, torch.device("cuda"), max_new_tokens=2048)
p_correct = correctness_prob(model, tokenizer, prompt + answer, torch.device("cuda"))

print("Answer:
", answer)
print("Gnosis correctness probability:", f"{p_correct:.4f}")

Citation

@misc{ghasemabadi2025llmspredictfailures,
      title={Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits}, 
      author={Amirhosein Ghasemabadi and Di Niu},
      year={2025},
      eprint={2512.20578},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.20578}
}

Downloads last month: 19

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for AmirhoseinGH/Gnosis-Qwen3-4B-Instruct-2507

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(350)

this model

Datasets used to train AmirhoseinGH/Gnosis-Qwen3-4B-Instruct-2507

Paper for AmirhoseinGH/Gnosis-Qwen3-4B-Instruct-2507

Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Paper • 2512.20578 • Published 15 days ago • 59