Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
Paper
•
2512.20578
•
Published
•
59
Gnosis is a lightweight self-awareness mechanism introduced in the paper Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits. It consists of a specialized head attached to a frozen LLM backbone (in this case, Qwen3-4B-Instruct-2507) that predicts a scalar correctness probability for a generated response.
The model reads the backbone’s internal signals—hidden-state features (latent dynamics) and attention-map patterns—to decode reliable correctness cues directly from the generation process.
Mixed math + trivia training corpus:
This repo requires the local Transformers fork with Gnosis integrated into the model architecture (see the GitHub repo instructions). After installing it, run:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.demo import build_chat_prompt, generate_with_hf, correctness_prob
GNOSIS_MODEL_ID = "AmirhoseinGH/Gnosis-Qwen3-4B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(GNOSIS_MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
GNOSIS_MODEL_ID, torch_dtype=torch.bfloat16, trust_remote_code=True
).cuda().eval()
prompt = build_chat_prompt(
tokenizer,
question="How many r's are in strawberry?",
system_prompt="Please reason step by step, and put your final answer within \\boxed{}.",
)
answer = generate_with_hf(model, tokenizer, prompt, torch.device("cuda"), max_new_tokens=2048)
p_correct = correctness_prob(model, tokenizer, prompt + answer, torch.device("cuda"))
print("Answer:
", answer)
print("Gnosis correctness probability:", f"{p_correct:.4f}")
@misc{ghasemabadi2025llmspredictfailures,
title={Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits},
author={Amirhosein Ghasemabadi and Di Niu},
year={2025},
eprint={2512.20578},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.20578}
}
Base model
Qwen/Qwen3-4B-Instruct-2507