Phi-4-reasoning-plus Β· GGUF Q4_K_M

Quantized, converted, and evaluated by PBH Applied Systems, LLC β€” Applied AI/ML Consulting Β· LLM Optimization & Deployment Β· Quantized AI Infrastructure

πŸ”¬ This repository is part of a production-oriented evaluation series. Every model published under pbhappliedsystems has been independently evaluated using quant_eval v7.21 β€” a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families β€” not perplexity or benchmark leaderboard proxies.

⚠️ This card documents significant evaluation findings. Phi-4-reasoning-plus Q4_K_M produces the lowest reasoning (0.365) and coherence (0.492) scores in the PBH Applied Systems evaluated series. The evaluation surfaces a systematic EOS token contamination pattern that causes complete failures across planning, MCQ, and tool dispatch families. These findings are documented in full below β€” with raw output evidence β€” as a demonstration of what rigorous pre-deployment evaluation surfaces that casual testing does not.


Try This Model in the Live AI Agent Demo

Launch the PBH Applied Systems AI Agent Demo β†’

This model is part of the PBH Applied Systems live AI Agent Demo, where visitors can test evaluated quantized open-weight models across production-style agent workflows: reasoning and analysis, document intelligence, and code automation.

The demo uses quant_eval results to show how model selection changes by task. A model that performs well for long-context document analysis may not be the best choice for hard multi-step planning, strict tool-use workflows, or production code generation. Each deployed model is evaluated for practical agent behavior, including coherence, instruction following, reasoning, task completion, structured output reliability, tool-use behavior, and quantization impact.

For this repository, the Q4_K_M variant represents the deployment-focused model: smaller, faster, and more cost-efficient than the F16 baseline. The evaluation results below explain where this quantized model preserves useful behavior, where quantization introduces risk, and what guardrails are recommended before production deployment.

The purpose of the demo is simple: let prospects test the same kind of evaluated quantized models that PBH Applied Systems deploys for real agentic AI systems.


Model Description

This repository contains the 4-bit quantized (Q4_K_M) GGUF of microsoft/Phi-4-reasoning-plus, a 14-billion parameter reasoning-tuned model from Microsoft. Phi-4-reasoning-plus is a chain-of-thought reasoning variant of the Phi-4 architecture, trained to perform extended internal deliberation before generating output.

Important evaluation scope note: This evaluation was conducted on the Q4_K_M variant only, using a custom runner (phi4_reasoning_plus_quant). The full-precision F16 GGUF was produced (29.3 GB, SHA256 documented below) but was not evaluated in this run. Consequently, no F16 vs. Q4_K_M delta comparison is available for this model. The results below reflect Q4_K_M performance in isolation. Whether an F16 baseline would perform substantially differently is not known from this evaluation β€” but what is known is that this model at Q4_K_M precision has significant, measurable production deployment risks.

The full-precision F16 GGUF is published separately at pbhappliedsystems/phi-4-reasoning-plus-gguf-F16.

Key Characteristics

  • Parameters: 14B
  • Architecture: Reasoning (extended chain-of-thought)
  • Format: GGUF Q4_K_M
  • File size: 9.05 GB
  • SHA256: 2fe74424b03433d11ccf3f2ce8da404810fa7eb9a269135b1f14bf0d88566e4d
  • Minimum VRAM (GPU inference): ~12 GB (T4 class or better)
  • Recommended GPU tier: NVIDIA T4 (16 GB) Β· RTX 3080/4080 Β· A10G
  • Context window: 16,384 tokens (per base model specification)
  • Inference speed (eval hardware): avg 25.84 sec/case on RTX 4090
  • License: MIT

PBH Applied Systems Evaluation β€” quant_eval v7.21

Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID: 20260222_170914 Β· Fixtures: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c...) Β· Seed: 42 Hardware: NVIDIA RTX 4090 Β· Runner: phi4_reasoning_plus_quant (Q4_K_M only) Β· Total rows: 42

No F16 baseline: This run evaluated the Q4_K_M variant only. Scores are not comparable to an F16 baseline because no F16 evaluation was performed. They reflect Q4_K_M performance on a standardized behavioral fixture set, comparable across the PBH Applied Systems evaluated series.

Aggregate Scores (Q4_K_M)

Scores are normalized to [0.0 – 1.0]. Higher is better.

Dimension Score Series Context
Task Completion 0.5976 Below series average
Reasoning 0.3648 Lowest in series
Coherence 0.4921 Lowest in series
Instruction Following 0.8658 Within normal range
Avg inference time 25.84 sec/case Consistent with reasoning architecture

Per-Family Pass Rates (phi4_reasoning_plus_quant)

Family N Pass Rate Avg Secs Bucket Score Notes
json_multistep 5 0.200 14.52 0.600 4/5 fail β€” EOS token output
stateful_followup 2 1.000 22.89 2.000 Both turns exact match
toolcall_only 2 0.000 19.95 0.000 Prose output instead of JSON
mixed_brief_json 2 1.000 17.46 2.000 Both pass cleanly
toolcall 2 1.000 13.98 0.000 ⚠️ Stage-1 passes; final_mismatch on both β€” see below
json 4 n/a 42.38 10.000 All pass
fuzz 20 n/a 34.57 10.000 All pass (7.71–93.68s range)
mcq 5 n/a 0.61 0.000 ⚠️ All 5 fail β€” EOS token output

Critical Findings β€” EOS Token Contamination

The most significant finding from this evaluation is a systematic <|im_end|> token contamination pattern. Across multiple task families, the model emits its end-of-sequence token (<|im_end|>) as literal visible text in its response content, rather than as a functional stop signal. This manifests differently depending on the task format β€” sometimes producing complete failures, sometimes coexisting with correct output, and sometimes interfering with answer extraction even when the underlying answer is correct.

Finding 1: json_multistep β€” EOS-Only Responses on 4/5 Cases

Case Difficulty Result Secs Raw Output
ms_easy_01 Easy ❌ FAIL 13.00 <|im_end|>
ms_easy_02 Easy ❌ FAIL 13.05 <|im_end|>
ms_med_01 Medium βœ… PASS 20.84 Valid JSON plan
ms_med_02 Medium ❌ FAIL 12.82 <|im_end|>
ms_hard_01 Hard ❌ FAIL 12.89 <|im_end|>

Four of five json_multistep cases produce <|im_end|> as their entire response. The model generates internal reasoning for 12–13 seconds, then emits only the EOS token β€” no plan, no checks, no final state. Every gating signal fails simultaneously (schema_ok=0, checks_consistent_ok=0, stop_semantics_ok=0, oracle_equiv_ok=0).

Only ms_med_01 produces a valid response (20.84 seconds, valid JSON plan, all signals pass). The one working case takes longer, suggesting the model successfully completes its reasoning chain on that input and emits a real response. The failing cases suggest the model abandons generation and terminates early via EOS for those specific prompts.

This is not a planning capability failure in the conventional sense β€” the model is not producing wrong plans. It is producing no plan at all on 4 of 5 cases.

Finding 2: MCQ β€” All 5 Cases Fail with EOS Output

Every MCQ case produces <|im_end|> as its raw output:

Case Secs Detail Raw
mcq_01 0.47 invalid_choice raw='<|im_end|>' <|im_end|>
mcq_02 0.62 invalid_choice raw='<|im_end|>' <|im_end|>
mcq_03 0.91 invalid_choice raw='<|im_end|>' <|im_end|>
mcq_04 0.16 invalid_choice raw='<|im_end|>' <|im_end|>
mcq_05 0.89 invalid_choice raw='<|im_end|>' <|im_end|>

All five MCQ cases are answered in under one second with an EOS token. The model produces no choice letter, no reasoning, no response β€” just termination. This results in a bucket_score average of 0.000 for MCQ across all five cases.

Finding 3: toolcall β€” Correct Arithmetic, Failed Extraction

toolcall passes at 1.000 (both stage-1 signals pass) but achieves bucket_score=0.000 on both cases due to final_mismatch. The raw outputs reveal what is happening:

Case Secs Raw Output Expected Result
tool_01 12.27 {"tool_name": "add", "args": {"a": 2, "b": 3}}<|im_end|> 5<|im_end|> 5 ❌ final_mismatch
tool_02 15.70 {"tool_name": "add", "args": {"a": 10, "b": -4}}<|im_end|> 6<|im_end|> 6 ❌ final_mismatch

The arithmetic is correct. add(2, 3) = 5 βœ“ and add(10, -4) = 6 βœ“. The model knows what to compute and computes it correctly. The failure is purely mechanical: the EOS token is embedded within the response string (5<|im_end|>), causing the answer extractor to capture the contaminated string rather than the clean numeric result.

The tool dispatch itself is valid β€” the stage-1 JSON parses correctly and validates against schema. This is a stop-token handling issue, not an arithmetic or tool-calling capability failure.

Finding 4: toolcall_only β€” Reasoning Prose Instead of JSON

Both toolcall_only cases produce natural language reasoning rather than the required JSON tool call:

  • toolonly_01: "In your answer, include the result and a brief explanation of how you arrived at..."
  • toolonly_02: "Thought: The user's instruction is to 'Use add tool to add 25 and 75.' Since this..."

Neither case produces a JSON object (detail=no_json_object). The model defaults to its natural reasoning-first format β€” generating explanatory prose β€” when asked for bare schema-only output. This is consistent with reasoning model architecture behavior observed across the series, but the failure is total here: not even the tool name is extracted.

What Passes and Why

The families that pass β€” json, fuzz, mixed_brief_json, stateful_followup β€” have output formats where the EOS token coexists with valid content without blocking extraction:

  • json/fuzz: Each turn produces {"tool_name": "...", "args": {...}}<|im_end|> β€” the JSON block precedes the EOS token and is extracted cleanly before the termination
  • mixed_brief_json: Output format is ANSWER: 13 {"a": 4, "b": 9, "sum": 13}<|im_end|> β€” the answer and JSON precede the EOS token
  • stateful_followup: Multi-turn state JSON precedes EOS in each turn

The common thread: when the required content appears before the EOS token, extraction succeeds. When the EOS token is the only content (json_multistep, MCQ), extraction fails. When it appears after a number that should be matched (toolcall final answer), extraction captures the contaminated string.


Signal-Level Diagnostics

json_multistep

Signal Rate Tier
schema_ok 0.200 Tier-1 (gating)
checks_consistent_ok 0.200 Tier-1 (gating)
stop_semantics_ok 0.200 Tier-1 (gating)
oracle_equiv_ok 0.200 Tier-1 (gating)
final_consistent_ok 0.000 Tier-2 (tracked, non-gating)
final_match_reported 0.000 Tier-2 (tracked, non-gating)

All four gating signals have identical rates (0.200 = 1/5 pass). On the four failing cases, every signal fails simultaneously β€” because the raw output is <|im_end|>, there is nothing to evaluate.

stateful_followup

Signal Rate
turn1_parse_ok 1.000
turn2_parse_ok 1.000
turn1_exact_match 1.000
turn2_exact_match 1.000

toolcall_only

Signal Rate
tool_name_ok 0.000
args_ok 0.000

mixed_brief_json

Signal Rate
answer_line_ok 1.000
json_parse_ok 1.000
schema_ok 1.000

Recommended Use Cases

βœ… Deploy with Confidence (Q4_K_M)

  • Stateful multi-turn agents β€” Both turns parse and match exactly (1.000). The state update format is unaffected by EOS contamination.
  • Hybrid brief + JSON outputs β€” mixed_brief_json passes at 1.000. The ANSWER: X {json} format works cleanly.
  • Single-step structured JSON β€” json and fuzz both achieve bucket_score 10.000. Constraint-adherent placements on all cases.

⚠️ Use with Modified Output Handling (Q4_K_M)

  • Scaffolded tool-calling β€” toolcall stage-1 passes at 1.000 and arithmetic is correct, but add an EOS token stripping step before final answer extraction. The capability is present; the stop token handling requires remediation.

❌ Not Recommended (Q4_K_M)

  • Multi-step planning β€” 4/5 cases produce no output. Do not deploy for planning workflows without validated prompt engineering that prevents EOS-only responses.
  • MCQ / single-choice extraction β€” All 5 cases fail with EOS-only output. This format is completely non-functional at Q4_K_M.
  • Bare tool-call dispatch (schema-only) β€” toolcall_only produces prose reasoning instead of JSON on both cases. Not viable without substantial prompt engineering.
  • Any latency-sensitive application β€” At 25.84 sec/case average with fuzz cases peaking at 93.68 seconds, this model is not suitable for responsive workloads.

The Evaluation Report Pitch β€” In Data

The findings above are the practical argument for systematic pre-deployment evaluation. Consider what casual testing would show:

  • Run a few json shelf-placement queries β†’ all pass, bucket=10 βœ“
  • Run a stateful follow-up conversation β†’ passes βœ“
  • Ask it to add two numbers β†’ produces the right answer βœ“

A developer doing informal validation would likely conclude this model works well for structured output and tool use. They would not know:

  • That planning prompts produce silent EOS failures on 4/5 cases
  • That every MCQ query terminates with an EOS token and no answer
  • That the addition result is correct but structurally broken in a way that would fail any downstream string comparison

None of these failure modes are visible without running the model against a standardized behavioral test suite across all relevant task families. The quant_eval evaluation surfaces them in 42 rows of structured, verifiable, reproducible evidence.

This is what a Quantized Model Evaluation Report documents β€” not whether a model can answer a few test questions, but what its actual failure modes are across the task families that matter in production.


Hardware Requirements

Configuration VRAM Required Recommended GPU
Q4_K_M (this repo) Β· GPU only ~12 GB T4 16 GB Β· RTX 3080/4080 Β· A10G
Q4_K_M Β· CPU offload fallback 8 GB VRAM + 4 GB RAM Any CUDA-capable GPU
F16 baseline (companion repo) ~32 GB A100 40 GB Β· 2Γ— A10G

Usage

Installation

pip install llama-cpp-python huggingface_hub

For GPU acceleration (CUDA):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Python β€” llama-cpp-python

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(
    repo_id="pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M",
    filename="phi-4-reasoning-plus-gguf-Q4-K-M.gguf"
)

llm = Llama(
    model_path=model_path,
    n_ctx=8192,
    n_gpu_layers=-1,
    verbose=False,
)

response = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a precise assistant. Think through the problem carefully before responding."
        },
        {
            "role": "user",
            "content": "Analyze the following data and return a structured JSON summary with keys: findings, confidence, recommendation."
        }
    ],
    temperature=0.15,
    max_tokens=2048,
)

print(response["choices"][0]["message"]["content"])

For tasks where EOS token contamination is a known risk, add a cleanup step before downstream processing:

import re

def strip_eos_tokens(text: str) -> str:
    """
    Strip EOS token contamination from Phi-4-reasoning-plus Q4_K_M outputs.
    quant_eval v7.21 finding: <|im_end|> appears as literal text in raw outputs,
    causing final_mismatch in toolcall and blocking extraction in other families.
    """
    return re.sub(r'<\|im_end\|>', '', text).strip()

raw = response["choices"][0]["message"]["content"]
clean = strip_eos_tokens(raw)
print(clean)

For stateful multi-turn use (reliable at Q4_K_M):

# Stateful follow-up passes at 1.000 β€” safe to deploy
conversation = [
    {"role": "system", "content": "You are a stateful assistant tracking structured data."},
    {"role": "user", "content": "Initialize a counter at 1. Return JSON: {\"counter\": N}"},
]

response1 = llm.create_chat_completion(
    messages=conversation,
    temperature=0.8,
    max_tokens=256,
)
turn1 = response1["choices"][0]["message"]["content"]
conversation.append({"role": "assistant", "content": turn1})
conversation.append({"role": "user", "content": "Increment the counter by 1."})

response2 = llm.create_chat_completion(
    messages=conversation,
    temperature=0.8,
    max_tokens=256,
)
print(strip_eos_tokens(response2["choices"][0]["message"]["content"]))

CLI β€” llama-cli

llama-cli \
  --model phi-4-reasoning-plus-gguf-Q4-K-M.gguf \
  --chat-template phi3 \
  --system-prompt "You are a precise reasoning assistant." \
  --prompt "Analyze the following and return structured JSON output." \
  --n-predict 2048 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --temp 0.8

For server deployment:

llama-server \
  --model phi-4-reasoning-plus-gguf-Q4-K-M.gguf \
  --chat-template phi3 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --port 8080 \
  --host 0.0.0.0

Query via the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")

response = client.chat.completions.create(
    model="phi-4-reasoning-plus-gguf-Q4-K-M",
    messages=[{"role": "user", "content": "Your prompt here"}],
    temperature=0.8,
    timeout=120,
)
# Strip EOS contamination before downstream use
import re
clean = re.sub(r'<\|im_end\|>', '', response.choices[0].message.content).strip()
print(clean)

Evaluation Artifacts

The full per-case evaluation CSV (comparison_results_v7_21_Phi_4_reasoning_plus_20260222_170914.csv) and rollup.json are published in this repository for independent verification. Every row in the CSV corresponds to a single inference run against a versioned test fixture, with the raw model output, all signal values, and the detail field documenting the failure reason.


Artifact Provenance

Artifact Format Size SHA256
phi-4-reasoning-plus-gguf-Q4-K-M.gguf GGUF Q4_K_M 9.05 GB 2fe74424b03433d11ccf3f2ce8da404810fa7eb9a269135b1f14bf0d88566e4d
F16 (companion repo, not evaluated) GGUF F16 29.3 GB 6491352a2d3d756fdd4b1538f188bafafc8e940658f1771308ffdaeddd86a385

Both artifacts were produced from microsoft/Phi-4-reasoning-plus using a custom-built llama.cpp conversion and quantization pipeline developed by PBH Applied Systems.


Evaluation Methodology

quant_eval v7.21 is a proprietary behavioral evaluation harness developed by PBH Applied Systems. This run evaluated the Q4_K_M variant only using a dedicated runner (phi4_reasoning_plus_quant).

Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)

Family Description Pass Signals
fuzz Property-based regression; structured placement correctness schema_ok, constraints_ok
json Single-step structured JSON with constraint rules schema_ok, constraints_ok
json_multistep Multi-step planning with self-check and oracle verification schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok
mcq Multiple-choice extraction choice_ok
stateful_followup Two-turn state tracking; turn-2 correct given turn-1 turn1/2_parse_ok, turn1/2_exact_match
mixed_brief_json Hybrid: natural language answer + valid JSON block answer_line_ok, json_parse_ok, schema_ok
toolcall Tool call embedded in response; parse + schema validation stage1_tool_parse_ok, stage1_tool_schema_ok
toolcall_only Bare schema-only tool call; strict tool name + args check tool_name_ok, args_ok

Evaluation hardware: NVIDIA RTX 4090 (24 GB VRAM) Evaluation date: February 22, 2026 quant_eval seed: 42


πŸ”¬ About quant_eval & This Evaluation Series

quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning β€” not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.

See it in action: Live AI Agent Demo β†’ The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.

Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? β†’ pbhappliedsystems.com


Evaluated and published by PBH Applied Systems, LLC Β· patrick@pbhappliedsystems.com


About PBH Applied Systems

PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development. The organization operates with a strong emphasis on engineering rigor, reproducibility, and real-world deployment constraints β€” particularly in environments where performance, cost efficiency, and reliability must be balanced against available hardware and budget.

Founder β€” Patrick Hill, M.S.

PBH Applied Systems was founded by Patrick Hill, a Data Scientist and AI/ML Engineer with 10+ years of experience delivering advanced analytics, predictive modeling, and decision-support solutions across high-stakes operational environments. Patrick holds a Master of Science in Software Engineering with concentrations in Artificial Intelligence and Machine Learning (GPA: 4.0) and a B.S. in Business Finance.

Technical expertise spans:

  • Languages & Data: Python, SQL, Linux, Pandas, NumPy, scikit-learn
  • ML & Modeling: Supervised and unsupervised learning, neural networks, NLP, transformers, regression, classification, forecasting, and feature engineering
  • AI/ML Frameworks: PyTorch, TensorFlow/Keras, HuggingFace Transformers, GGUF, llama.cpp, BitsAndBytes, PEFT, QLoRA
  • Deployment & MLOps: Flask APIs, Docker, CI/CD pipelines, REST endpoints, streaming inference, version control
  • Data Platforms: Jupyter, Databricks, Power BI, Matplotlib
  • Quantization: GGUF conversion, Q4_K_M / Q5_K_M / Q8_0 strategies, adapter-per-model evaluation architecture

Published Author

Patrick is the author of Applied Machine Learning: Concepts, Tools, and Case Studies β€” a 1,200+ page practitioner-oriented textbook adopted as required reading for CSC 373 – Machine Learning at the University of Advancing Technology.

Core Service Areas

1. LLM Optimization & Deployment β€” End-to-end GGUF conversion and quantization with custom llama.cpp pipelines and adapter-per-model architecture.

2. AI Evaluation Frameworks β€” Proprietary behavioral evaluation via quant_eval: per-family pass rates, failure cluster diagnostics, raw output evidence, and deployment recommendations.

3. Agentic AI Infrastructure β€” LlamaIndex ReAct agents, Flask orchestration, serverless GPU inference, full pipeline from model selection to production serving.

4. Scalable AI Application Development β€” Multimodal applications (quantized LLMs + Whisper + BLIP), Dockerized Flask APIs, advanced time-series forecasting with custom attention mechanisms, Bayesian hyperparameter optimization, and FinBERT sentiment fusion.

5. ML Pipeline Design & Analytics β€” Feature engineering, forward-chaining cross-validation, KPI dashboards, analytical governance at scale.

6. Model & Agent Cataloging β€” Structured catalog publishing with reproducible artifacts and clear performance tradeoff documentation.


πŸ“ž Work With PBH Applied Systems

The findings documented in this card β€” EOS token contamination producing silent failures across planning and MCQ, correct arithmetic answers blocked by stop token handling, prose output where JSON was required β€” are precisely the kind of deployment risks that casual testing does not surface.

A developer running informal validation would see the json, fuzz, mixed, and stateful families pass. They would not see the 4/5 planning failures, the 5/5 MCQ failures, or the toolcall extraction bug. Those failures reach production silently without systematic evaluation.

This card is not an indictment of the Phi-4-reasoning-plus model. It is documentation that at Q4_K_M precision, with this build configuration, specific task categories fail in specific and reproducible ways. That is information a team needs before deployment β€” not after.

πŸ‘‰ Book a Scoping Call β€” Discuss your model selection, quantization evaluation needs, or deployment architecture directly with Patrick.

πŸ‘‰ Request an Evaluation Report β€” A full quant_eval behavioral audit for your target model(s): per-family pass rates, failure cluster diagnostics, raw output evidence, and a deployment recommendation. Engagements from $2,500.

Connect

🌐 Website pbhappliedsystems.com
πŸ“§ Email patrick@pbhappliedsystems.com
πŸ’Ό LinkedIn PBH Applied Systems, LLC
▢️ YouTube @pbhappliedsystems
πŸ“Έ Instagram @pbhappliedsystems
πŸ‘ Facebook pbhappliedsystems

License

This GGUF repository inherits the license of the base model: MIT β€” microsoft/Phi-4-reasoning-plus

The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.


GGUF conversion, quantization, and behavioral evaluation performed by PBH Applied Systems, LLC Β· quant_eval v7.21 Β· Run ID: 20260222_170914

Downloads last month
217
GGUF
Model size
15B params
Architecture
phi3
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M

Base model

microsoft/phi-4
Quantized
(44)
this model

Space using pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M 1