| --- |
| language: |
| - en |
| license: cc-by-nc-2.0 |
| library_name: transformers |
| tags: |
| - citation-verification |
| - retrieval-augmented-generation |
| - rag |
| - cross-lingual |
| - deberta |
| - cross-encoder |
| - nli |
| - attribution |
| pipeline_tag: text-classification |
| datasets: |
| - fever |
| - din0s/asqa |
| - miracl/hagrid |
| metrics: |
| - f1 |
| - precision |
| - recall |
| - accuracy |
| - roc_auc |
| base_model: microsoft/deberta-v3-base |
| model-index: |
| - name: dualtrack-alignment-module |
| results: |
| - task: |
| type: text-classification |
| name: Citation Verification |
| metrics: |
| - type: f1 |
| value: 0.89 |
| name: F1 Score |
| - type: accuracy |
| value: 0.87 |
| name: Accuracy |
| - type: roc_auc |
| value: 0.94 |
| name: ROC-AUC |
| --- |
| |
| # DualTrack Alignment Module |
|
|
| > **Anonymous submission to ACL 2026** |
|
|
| A cross-encoder model for detecting **citation drift** in Retrieval-Augmented Generation (RAG) systems. Given a user-facing claim, an evidence representation, and a source passage, the model predicts whether the citation is valid (the source supports the claim). |
|
|
| ## Model Description |
|
|
| This model addresses a critical reliability problem in RAG systems: **citation drift**, where generated text diverges from source documents in ways that break attribution. The problem is particularly severe in cross-lingual settings where the answer language differs from source document language. |
|
|
| ### Architecture |
|
|
| ``` |
| Input: "[CLS] User claim: {claim} [SEP] Evidence: {evidence} [SEP] Source passage: {context} [SEP]" |
| ↓ |
| DeBERTa-v3-base (184M parameters) |
| ↓ |
| [CLS] embedding (768-dim) |
| ↓ |
| Linear(768, 2) → Softmax |
| ↓ |
| Output: P(valid citation) |
| ``` |
|
|
| ### Why Cross-Encoder? |
|
|
| Unlike embedding-based approaches that encode texts separately, the cross-encoder sees all three components **together**, enabling: |
| - Cross-attention between claim and source |
| - Detection of subtle semantic mismatches |
| - Better handling of paraphrases vs. factual errors |
|
|
| ## Intended Use |
|
|
| ### Primary Use Cases |
|
|
| 1. **Post-hoc citation verification**: Validate citations in RAG outputs before serving to users |
| 2. **Citation drift detection**: Identify claims that have semantically drifted from their sources |
| 3. **Training signal**: Provide rewards for citation-aware generation |
|
|
| ### Out of Scope |
|
|
| - General NLI/entailment (model is specialized for RAG citation patterns) |
| - Fact-checking against world knowledge (requires source passage) |
| - Non-English source documents (trained on English sources only) |
|
|
| ## How to Use |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers torch |
| ``` |
|
|
| ### Basic Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| # Load model |
| model_name = "anonymous-acl2026/dualtrack-alignment" # Replace with actual path |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| model.eval() |
| |
| def check_citation(user_claim: str, evidence: str, source: str, threshold: float = 0.5) -> tuple[bool, float]: |
| """ |
| Check if a citation is valid. |
| |
| Args: |
| user_claim: The claim shown to the user |
| evidence: Evidence track representation (can be same as user_claim) |
| source: The source passage being cited |
| threshold: Classification threshold (default from training) |
| |
| Returns: |
| (is_valid, probability) |
| """ |
| # Format input |
| text = f"User claim: {user_claim}\n\nEvidence: {evidence}\n\nSource passage: {source}" |
| |
| # Tokenize |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
| |
| # Predict |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| prob = torch.softmax(outputs.logits, dim=-1)[0, 1].item() |
| |
| return prob >= threshold, prob |
| |
| # Example: Valid citation |
| is_valid, prob = check_citation( |
| user_claim="Python was created by Guido van Rossum.", |
| evidence="Python was created by Guido van Rossum.", |
| source="Python is a programming language created by Guido van Rossum in 1991." |
| ) |
| print(f"Valid: {is_valid}, Probability: {prob:.3f}") |
| # Output: Valid: True, Probability: 0.95 |
| |
| # Example: Invalid citation (wrong date) |
| is_valid, prob = check_citation( |
| user_claim="Python was created in 1989.", |
| evidence="Python was created in 1989.", |
| source="Python is a programming language created by Guido van Rossum in 1991." |
| ) |
| print(f"Valid: {is_valid}, Probability: {prob:.3f}") |
| # Output: Valid: False, Probability: 0.12 |
| ``` |
|
|
| ### Batch Processing |
|
|
| ```python |
| def batch_check_citations(examples: list[dict], batch_size: int = 16) -> list[float]: |
| """ |
| Check multiple citations efficiently. |
| |
| Args: |
| examples: List of dicts with keys 'user', 'evidence', 'source' |
| batch_size: Batch size for inference |
| |
| Returns: |
| List of probabilities |
| """ |
| all_probs = [] |
| |
| for i in range(0, len(examples), batch_size): |
| batch = examples[i:i + batch_size] |
| |
| texts = [ |
| f"User claim: {ex['user']}\n\nEvidence: {ex['evidence']}\n\nSource passage: {ex['source']}" |
| for ex in batch |
| ] |
| |
| inputs = tokenizer( |
| texts, |
| return_tensors="pt", |
| truncation=True, |
| max_length=512, |
| padding=True |
| ) |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| probs = torch.softmax(outputs.logits, dim=-1)[:, 1].tolist() |
| |
| all_probs.extend(probs) |
| |
| return all_probs |
| ``` |
|
|
| ### Integration with DualTrack |
|
|
| ```python |
| class DualTrackAlignmentModule: |
| """ |
| Alignment module for the DualTrack RAG system. |
| |
| Detects citation drift between user track and source documents. |
| """ |
| |
| def __init__(self, model_path: str, threshold: float = None, device: str = None): |
| self.device = device or ("cuda" if torch.cuda.is_available() else "cpu") |
| self.tokenizer = AutoTokenizer.from_pretrained(model_path) |
| self.model = AutoModelForSequenceClassification.from_pretrained(model_path) |
| self.model.to(self.device) |
| self.model.eval() |
| |
| # Load optimal threshold from metadata |
| import json |
| import os |
| metadata_path = os.path.join(model_path, "metadata.json") |
| if os.path.exists(metadata_path): |
| with open(metadata_path) as f: |
| metadata = json.load(f) |
| self.threshold = threshold or metadata.get("optimal_threshold", 0.5) |
| else: |
| self.threshold = threshold or 0.5 |
| |
| def detect_drift( |
| self, |
| user_claims: list[str], |
| evidence_claims: list[str], |
| sources: list[str] |
| ) -> list[dict]: |
| """ |
| Detect citation drift for multiple claim-source pairs. |
| |
| Returns list of {is_valid, probability, drift_detected}. |
| """ |
| results = [] |
| |
| for user, evidence, source in zip(user_claims, evidence_claims, sources): |
| text = f"User claim: {user}\n\nEvidence: {evidence}\n\nSource passage: {source}" |
| |
| inputs = self.tokenizer( |
| text, return_tensors="pt", truncation=True, max_length=512 |
| ).to(self.device) |
| |
| with torch.no_grad(): |
| outputs = self.model(**inputs) |
| prob = torch.softmax(outputs.logits, dim=-1)[0, 1].item() |
| |
| results.append({ |
| "is_valid": prob >= self.threshold, |
| "probability": prob, |
| "drift_detected": prob < self.threshold |
| }) |
| |
| return results |
| ``` |
|
|
| ## Training Details |
|
|
| ### Training Data |
|
|
| The model was trained on a curated dataset combining multiple sources: |
|
|
| | Source | Examples | Description | |
| |--------|----------|-------------| |
| | FEVER | ~8,000 | Fact verification with SUPPORTS/REFUTES labels | |
| | HAGRID | ~2,000 | Attributed QA with quote-based evidence | |
| | ASQA | ~3,000 | Ambiguous questions with long-form answers | |
|
|
| **Label Generation (V3 - LLM-Supervised)**: |
| - Training labels verified by GPT-4o-mini ("Does context support claim?") |
| - Evaluation uses independent NLI model (DeBERTa-MNLI) |
| - This breaks circularity: model learns LLM judgment, evaluated by NLI |
|
|
| **Data Augmentation**: |
| - **Negative perturbations**: date_change, number_change, entity_swap, false_detail, negation, topic_drift |
| - **Positive perturbations**: paraphrase, synonym_swap, formal_informal register changes |
| |
| ### Training Procedure |
| |
| | Hyperparameter | Value | |
| |----------------|-------| |
| | Base model | `microsoft/deberta-v3-base` | |
| | Max sequence length | 512 | |
| | Batch size | 8 | |
| | Gradient accumulation | 2 | |
| | Effective batch size | 16 | |
| | Learning rate | 2e-5 | |
| | Warmup ratio | 0.1 | |
| | Weight decay | 0.01 | |
| | Epochs | 5 | |
| | Early stopping patience | 3 | |
| | FP16 training | Yes | |
| | Optimizer | AdamW | |
| |
| **Training Infrastructure**: |
| - Single GPU (NVIDIA T4/V100) |
| - Training time: ~2-3 hours |
| - Framework: HuggingFace Transformers + PyTorch |
| |
| ### Evaluation |
| |
| **Validation Set Performance** (15% held-out, stratified): |
| |
| | Metric | Score | |
| |--------|-------| |
| | Accuracy | 0.87 | |
| | Precision | 0.88 | |
| | Recall | 0.90 | |
| | F1 | 0.89 | |
| | ROC-AUC | 0.94 | |
| |
| **Optimal Threshold**: 0.50 (determined via F1 maximization on validation set) |
| |
| **Performance by Perturbation Type**: |
| |
| | Type | Accuracy | Notes | |
| |------|----------|-------| |
| | original | 0.91 | Clean examples | |
| | paraphrase | 0.88 | Meaning-preserving rewrites | |
| | entity_swap | 0.94 | Wrong person/place/org | |
| | date_change | 0.92 | Incorrect dates | |
| | negation | 0.89 | Reversed claims | |
| | topic_drift | 0.85 | Subtle semantic shifts | |
|
|
| ## Limitations |
|
|
| 1. **English only**: Trained on English source passages. Cross-lingual application requires translation or multilingual encoder. |
|
|
| 2. **RAG-specific**: Optimized for RAG citation patterns; may not generalize to arbitrary NLI tasks. |
|
|
| 3. **Passage length**: Max 512 tokens. Long documents require chunking or summarization. |
|
|
| 4. **Threshold sensitivity**: Default threshold (0.5) may need tuning for specific applications. High-precision applications should use higher thresholds. |
|
|
| 5. **Training data bias**: Performance may vary on domains not represented in FEVER/HAGRID/ASQA (e.g., legal, medical, code). |
|
|
| ## Ethical Considerations |
|
|
| ### Intended Benefits |
| - Improved reliability of AI-generated citations |
| - Reduced misinformation from RAG hallucinations |
| - Better transparency in AI-assisted research |
|
|
| ### Potential Risks |
| - Over-reliance on automated verification (human review still recommended for high-stakes applications) |
| - False negatives may incorrectly flag valid citations |
| - False positives may miss genuine attribution errors |
|
|
| ### Recommendations |
| - Use as one signal among many, not sole arbiter |
| - Monitor performance on domain-specific data |
| - Combine with human review for critical applications |
|
|
|
|
| *This model is part of an anonymous submission to ACL 2026. Author information will be added upon acceptance.* |