Gnosis — Qwen3-4B-Thinking-2507 (Self-Awareness Correctness Head)

Gnosis is a lightweight self-awareness head that attaches to a frozen LLM and predicts a scalar correctness probability for a generated response. It reads the backbone’s internal signals—hidden-state features (latent dynamics) and attention-map patterns—to learn reliable hallucination / error cues directly from the model.

Paper: Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits Project code & instructions: https://github.com/Amirhosein-gh98/Gnosis

Why it matters

Strong verifier signal without a large external reward model (no RM routing / no judge LLM calls).
~1000× smaller than 8B reward-model verifiers (**5M params vs ~8B**).
~100× faster than routing through an ~8B reward model.
Early error detection: can flag likely errors before generation finishes.

Evaluated backbones & benchmarks (from the paper)

Backbones: Qwen3 family + OpenAI gpt-oss-20B.
Benchmarks: Math-Reasoning (AMC12 2022/2023, AIME 2024/2025, HMMT Feb 2025), Open-Domain QA (18k held-out TriviaQA), Academic Knowledge Reasoning (MMLU-Pro).

Training data

Mixed math + trivia training corpus:

Math: English portion of DAPO-Math-17k (~14k).
Trivia: 40k subsample from TriviaQA training set.

Usage (inference)

This repo requires the local Transformers fork with Gnosis integrated (see the GitHub repo instructions). After installing it, run:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.demo import build_chat_prompt, generate_with_hf, correctness_prob

GNOSIS_MODEL_ID = "AmirhoseinGH/Gnosis-Qwen3-4B-Thinking-2507"

tokenizer = AutoTokenizer.from_pretrained(GNOSIS_MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    GNOSIS_MODEL_ID, torch_dtype=torch.bfloat16, trust_remote_code=True
).cuda().eval()

prompt = build_chat_prompt(
    tokenizer,
    question="How many r's are in strawberry?",
    system_prompt="Please reason step by step, and put your final answer within \\boxed{}.",
)

answer = generate_with_hf(model, tokenizer, prompt, torch.device("cuda"), max_new_tokens=2048)
p_correct = correctness_prob(model, tokenizer, prompt + answer, torch.device("cuda"))

print("Answer:
", answer)
print("Gnosis correctness probability:", f"{p_correct:.4f}")

Citation

@misc{ghasemabadi2024llmspredictfailures,
      title={Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits}, 
      author={Amirhosein Ghasemabadi and Di Niu},
      year={2024},
      eprint={2512.20578},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.20578}, 
}

Downloads last month: 27

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for AmirhoseinGH/Gnosis-Qwen3-4B-Thinking-2507

Base model

Qwen/Qwen3-4B-Thinking-2507

Finetuned

(150)

this model

Datasets used to train AmirhoseinGH/Gnosis-Qwen3-4B-Thinking-2507

Paper for AmirhoseinGH/Gnosis-Qwen3-4B-Thinking-2507

Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Paper • 2512.20578 • Published 15 days ago • 59