Instructions to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M", filename="phi-4-reasoning-plus-gguf-Q4-K-M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M # Run inference directly in the terminal: ./llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M # Run inference directly in the terminal: ./build/bin/llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
Use Docker
docker model run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
- LM Studio
- Jan
- Ollama
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with Ollama:
ollama run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
- Unsloth Studio
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M to start chatting
- Docker Model Runner
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with Docker Model Runner:
docker model run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
- Lemonade
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
Run and chat with the model
lemonade run user.phi-4-reasoning-plus-gguf-Q4-K-M-{{QUANT_TAG}}List all available models
lemonade list
- Phi-4-reasoning-plus Β· GGUF Q4_K_M
- Try This Model in the Live AI Agent Demo
- Model Description
- PBH Applied Systems Evaluation β quant_eval v7.21
- Critical Findings β EOS Token Contamination
- Signal-Level Diagnostics
- Recommended Use Cases
- The Evaluation Report Pitch β In Data
- Hardware Requirements
- Usage
- Evaluation Artifacts
- Artifact Provenance
- Evaluation Methodology
- π¬ About quant_eval & This Evaluation Series
- About PBH Applied Systems
- π Work With PBH Applied Systems
- License
- Try This Model in the Live AI Agent Demo
Phi-4-reasoning-plus Β· GGUF Q4_K_M
Quantized, converted, and evaluated by PBH Applied Systems, LLC β Applied AI/ML Consulting Β· LLM Optimization & Deployment Β· Quantized AI Infrastructure
π¬ This repository is part of a production-oriented evaluation series. Every model published under
pbhappliedsystemshas been independently evaluated using quant_eval v7.21 β a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families β not perplexity or benchmark leaderboard proxies.
β οΈ This card documents significant evaluation findings. Phi-4-reasoning-plus Q4_K_M produces the lowest reasoning (0.365) and coherence (0.492) scores in the PBH Applied Systems evaluated series. The evaluation surfaces a systematic EOS token contamination pattern that causes complete failures across planning, MCQ, and tool dispatch families. These findings are documented in full below β with raw output evidence β as a demonstration of what rigorous pre-deployment evaluation surfaces that casual testing does not.
Try This Model in the Live AI Agent Demo
Launch the PBH Applied Systems AI Agent Demo β
This model is part of the PBH Applied Systems live AI Agent Demo, where visitors can test evaluated quantized open-weight models across production-style agent workflows: reasoning and analysis, document intelligence, and code automation.
The demo uses quant_eval results to show how model selection changes by task. A model that performs well for long-context document analysis may not be the best choice for hard multi-step planning, strict tool-use workflows, or production code generation. Each deployed model is evaluated for practical agent behavior, including coherence, instruction following, reasoning, task completion, structured output reliability, tool-use behavior, and quantization impact.
For this repository, the Q4_K_M variant represents the deployment-focused model: smaller, faster, and more cost-efficient than the F16 baseline. The evaluation results below explain where this quantized model preserves useful behavior, where quantization introduces risk, and what guardrails are recommended before production deployment.
The purpose of the demo is simple: let prospects test the same kind of evaluated quantized models that PBH Applied Systems deploys for real agentic AI systems.
Model Description
This repository contains the 4-bit quantized (Q4_K_M) GGUF of microsoft/Phi-4-reasoning-plus, a 14-billion parameter reasoning-tuned model from Microsoft. Phi-4-reasoning-plus is a chain-of-thought reasoning variant of the Phi-4 architecture, trained to perform extended internal deliberation before generating output.
Important evaluation scope note: This evaluation was conducted on the Q4_K_M variant only, using a custom runner (phi4_reasoning_plus_quant). The full-precision F16 GGUF was produced (29.3 GB, SHA256 documented below) but was not evaluated in this run. Consequently, no F16 vs. Q4_K_M delta comparison is available for this model. The results below reflect Q4_K_M performance in isolation. Whether an F16 baseline would perform substantially differently is not known from this evaluation β but what is known is that this model at Q4_K_M precision has significant, measurable production deployment risks.
The full-precision F16 GGUF is published separately at pbhappliedsystems/phi-4-reasoning-plus-gguf-F16.
Key Characteristics
- Parameters: 14B
- Architecture: Reasoning (extended chain-of-thought)
- Format: GGUF Q4_K_M
- File size: 9.05 GB
- SHA256:
2fe74424b03433d11ccf3f2ce8da404810fa7eb9a269135b1f14bf0d88566e4d - Minimum VRAM (GPU inference): ~12 GB (T4 class or better)
- Recommended GPU tier: NVIDIA T4 (16 GB) Β· RTX 3080/4080 Β· A10G
- Context window: 16,384 tokens (per base model specification)
- Inference speed (eval hardware): avg 25.84 sec/case on RTX 4090
- License: MIT
PBH Applied Systems Evaluation β quant_eval v7.21
Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID:
20260222_170914Β· Fixtures:golden_oracle_fixtures_v7_21(SHA256:6d71a0b9147c...) Β· Seed: 42 Hardware: NVIDIA RTX 4090 Β· Runner:phi4_reasoning_plus_quant(Q4_K_M only) Β· Total rows: 42
No F16 baseline: This run evaluated the Q4_K_M variant only. Scores are not comparable to an F16 baseline because no F16 evaluation was performed. They reflect Q4_K_M performance on a standardized behavioral fixture set, comparable across the PBH Applied Systems evaluated series.
Aggregate Scores (Q4_K_M)
Scores are normalized to [0.0 β 1.0]. Higher is better.
| Dimension | Score | Series Context |
|---|---|---|
| Task Completion | 0.5976 | Below series average |
| Reasoning | 0.3648 | Lowest in series |
| Coherence | 0.4921 | Lowest in series |
| Instruction Following | 0.8658 | Within normal range |
| Avg inference time | 25.84 sec/case | Consistent with reasoning architecture |
Per-Family Pass Rates (phi4_reasoning_plus_quant)
| Family | N | Pass Rate | Avg Secs | Bucket Score | Notes |
|---|---|---|---|---|---|
| json_multistep | 5 | 0.200 | 14.52 | 0.600 | 4/5 fail β EOS token output |
| stateful_followup | 2 | 1.000 | 22.89 | 2.000 | Both turns exact match |
| toolcall_only | 2 | 0.000 | 19.95 | 0.000 | Prose output instead of JSON |
| mixed_brief_json | 2 | 1.000 | 17.46 | 2.000 | Both pass cleanly |
| toolcall | 2 | 1.000 | 13.98 | 0.000 | β οΈ Stage-1 passes; final_mismatch on both β see below |
| json | 4 | n/a | 42.38 | 10.000 | All pass |
| fuzz | 20 | n/a | 34.57 | 10.000 | All pass (7.71β93.68s range) |
| mcq | 5 | n/a | 0.61 | 0.000 | β οΈ All 5 fail β EOS token output |
Critical Findings β EOS Token Contamination
The most significant finding from this evaluation is a systematic <|im_end|> token contamination pattern. Across multiple task families, the model emits its end-of-sequence token (<|im_end|>) as literal visible text in its response content, rather than as a functional stop signal. This manifests differently depending on the task format β sometimes producing complete failures, sometimes coexisting with correct output, and sometimes interfering with answer extraction even when the underlying answer is correct.
Finding 1: json_multistep β EOS-Only Responses on 4/5 Cases
| Case | Difficulty | Result | Secs | Raw Output |
|---|---|---|---|---|
| ms_easy_01 | Easy | β FAIL | 13.00 | <|im_end|> |
| ms_easy_02 | Easy | β FAIL | 13.05 | <|im_end|> |
| ms_med_01 | Medium | β PASS | 20.84 | Valid JSON plan |
| ms_med_02 | Medium | β FAIL | 12.82 | <|im_end|> |
| ms_hard_01 | Hard | β FAIL | 12.89 | <|im_end|> |
Four of five json_multistep cases produce <|im_end|> as their entire response. The model generates internal reasoning for 12β13 seconds, then emits only the EOS token β no plan, no checks, no final state. Every gating signal fails simultaneously (schema_ok=0, checks_consistent_ok=0, stop_semantics_ok=0, oracle_equiv_ok=0).
Only ms_med_01 produces a valid response (20.84 seconds, valid JSON plan, all signals pass). The one working case takes longer, suggesting the model successfully completes its reasoning chain on that input and emits a real response. The failing cases suggest the model abandons generation and terminates early via EOS for those specific prompts.
This is not a planning capability failure in the conventional sense β the model is not producing wrong plans. It is producing no plan at all on 4 of 5 cases.
Finding 2: MCQ β All 5 Cases Fail with EOS Output
Every MCQ case produces <|im_end|> as its raw output:
| Case | Secs | Detail | Raw |
|---|---|---|---|
| mcq_01 | 0.47 | invalid_choice raw='<|im_end|>' |
<|im_end|> |
| mcq_02 | 0.62 | invalid_choice raw='<|im_end|>' |
<|im_end|> |
| mcq_03 | 0.91 | invalid_choice raw='<|im_end|>' |
<|im_end|> |
| mcq_04 | 0.16 | invalid_choice raw='<|im_end|>' |
<|im_end|> |
| mcq_05 | 0.89 | invalid_choice raw='<|im_end|>' |
<|im_end|> |
All five MCQ cases are answered in under one second with an EOS token. The model produces no choice letter, no reasoning, no response β just termination. This results in a bucket_score average of 0.000 for MCQ across all five cases.
Finding 3: toolcall β Correct Arithmetic, Failed Extraction
toolcall passes at 1.000 (both stage-1 signals pass) but achieves bucket_score=0.000 on both cases due to final_mismatch. The raw outputs reveal what is happening:
| Case | Secs | Raw Output | Expected | Result |
|---|---|---|---|---|
| tool_01 | 12.27 | {"tool_name": "add", "args": {"a": 2, "b": 3}}<|im_end|> 5<|im_end|> |
5 |
β final_mismatch |
| tool_02 | 15.70 | {"tool_name": "add", "args": {"a": 10, "b": -4}}<|im_end|> 6<|im_end|> |
6 |
β final_mismatch |
The arithmetic is correct. add(2, 3) = 5 β and add(10, -4) = 6 β. The model knows what to compute and computes it correctly. The failure is purely mechanical: the EOS token is embedded within the response string (5<|im_end|>), causing the answer extractor to capture the contaminated string rather than the clean numeric result.
The tool dispatch itself is valid β the stage-1 JSON parses correctly and validates against schema. This is a stop-token handling issue, not an arithmetic or tool-calling capability failure.
Finding 4: toolcall_only β Reasoning Prose Instead of JSON
Both toolcall_only cases produce natural language reasoning rather than the required JSON tool call:
toolonly_01: "In your answer, include the result and a brief explanation of how you arrived at..."toolonly_02: "Thought: The user's instruction is to 'Use add tool to add 25 and 75.' Since this..."
Neither case produces a JSON object (detail=no_json_object). The model defaults to its natural reasoning-first format β generating explanatory prose β when asked for bare schema-only output. This is consistent with reasoning model architecture behavior observed across the series, but the failure is total here: not even the tool name is extracted.
What Passes and Why
The families that pass β json, fuzz, mixed_brief_json, stateful_followup β have output formats where the EOS token coexists with valid content without blocking extraction:
- json/fuzz: Each turn produces
{"tool_name": "...", "args": {...}}<|im_end|>β the JSON block precedes the EOS token and is extracted cleanly before the termination - mixed_brief_json: Output format is
ANSWER: 13 {"a": 4, "b": 9, "sum": 13}<|im_end|>β the answer and JSON precede the EOS token - stateful_followup: Multi-turn state JSON precedes EOS in each turn
The common thread: when the required content appears before the EOS token, extraction succeeds. When the EOS token is the only content (json_multistep, MCQ), extraction fails. When it appears after a number that should be matched (toolcall final answer), extraction captures the contaminated string.
Signal-Level Diagnostics
json_multistep
| Signal | Rate | Tier |
|---|---|---|
| schema_ok | 0.200 | Tier-1 (gating) |
| checks_consistent_ok | 0.200 | Tier-1 (gating) |
| stop_semantics_ok | 0.200 | Tier-1 (gating) |
| oracle_equiv_ok | 0.200 | Tier-1 (gating) |
| final_consistent_ok | 0.000 | Tier-2 (tracked, non-gating) |
| final_match_reported | 0.000 | Tier-2 (tracked, non-gating) |
All four gating signals have identical rates (0.200 = 1/5 pass). On the four failing cases, every signal fails simultaneously β because the raw output is <|im_end|>, there is nothing to evaluate.
stateful_followup
| Signal | Rate |
|---|---|
| turn1_parse_ok | 1.000 |
| turn2_parse_ok | 1.000 |
| turn1_exact_match | 1.000 |
| turn2_exact_match | 1.000 |
toolcall_only
| Signal | Rate |
|---|---|
| tool_name_ok | 0.000 |
| args_ok | 0.000 |
mixed_brief_json
| Signal | Rate |
|---|---|
| answer_line_ok | 1.000 |
| json_parse_ok | 1.000 |
| schema_ok | 1.000 |
Recommended Use Cases
β Deploy with Confidence (Q4_K_M)
- Stateful multi-turn agents β Both turns parse and match exactly (1.000). The state update format is unaffected by EOS contamination.
- Hybrid brief + JSON outputs β
mixed_brief_jsonpasses at 1.000. TheANSWER: X {json}format works cleanly. - Single-step structured JSON β
jsonandfuzzboth achieve bucket_score 10.000. Constraint-adherent placements on all cases.
β οΈ Use with Modified Output Handling (Q4_K_M)
- Scaffolded tool-calling β
toolcallstage-1 passes at 1.000 and arithmetic is correct, but add an EOS token stripping step before final answer extraction. The capability is present; the stop token handling requires remediation.
β Not Recommended (Q4_K_M)
- Multi-step planning β 4/5 cases produce no output. Do not deploy for planning workflows without validated prompt engineering that prevents EOS-only responses.
- MCQ / single-choice extraction β All 5 cases fail with EOS-only output. This format is completely non-functional at Q4_K_M.
- Bare tool-call dispatch (schema-only) β
toolcall_onlyproduces prose reasoning instead of JSON on both cases. Not viable without substantial prompt engineering. - Any latency-sensitive application β At 25.84 sec/case average with fuzz cases peaking at 93.68 seconds, this model is not suitable for responsive workloads.
The Evaluation Report Pitch β In Data
The findings above are the practical argument for systematic pre-deployment evaluation. Consider what casual testing would show:
- Run a few json shelf-placement queries β all pass, bucket=10 β
- Run a stateful follow-up conversation β passes β
- Ask it to add two numbers β produces the right answer β
A developer doing informal validation would likely conclude this model works well for structured output and tool use. They would not know:
- That planning prompts produce silent EOS failures on 4/5 cases
- That every MCQ query terminates with an EOS token and no answer
- That the addition result is correct but structurally broken in a way that would fail any downstream string comparison
None of these failure modes are visible without running the model against a standardized behavioral test suite across all relevant task families. The quant_eval evaluation surfaces them in 42 rows of structured, verifiable, reproducible evidence.
This is what a Quantized Model Evaluation Report documents β not whether a model can answer a few test questions, but what its actual failure modes are across the task families that matter in production.
Hardware Requirements
| Configuration | VRAM Required | Recommended GPU |
|---|---|---|
| Q4_K_M (this repo) Β· GPU only | ~12 GB | T4 16 GB Β· RTX 3080/4080 Β· A10G |
| Q4_K_M Β· CPU offload fallback | 8 GB VRAM + 4 GB RAM | Any CUDA-capable GPU |
| F16 baseline (companion repo) | ~32 GB | A100 40 GB Β· 2Γ A10G |
Usage
Installation
pip install llama-cpp-python huggingface_hub
For GPU acceleration (CUDA):
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Python β llama-cpp-python
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
model_path = hf_hub_download(
repo_id="pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M",
filename="phi-4-reasoning-plus-gguf-Q4-K-M.gguf"
)
llm = Llama(
model_path=model_path,
n_ctx=8192,
n_gpu_layers=-1,
verbose=False,
)
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a precise assistant. Think through the problem carefully before responding."
},
{
"role": "user",
"content": "Analyze the following data and return a structured JSON summary with keys: findings, confidence, recommendation."
}
],
temperature=0.15,
max_tokens=2048,
)
print(response["choices"][0]["message"]["content"])
For tasks where EOS token contamination is a known risk, add a cleanup step before downstream processing:
import re
def strip_eos_tokens(text: str) -> str:
"""
Strip EOS token contamination from Phi-4-reasoning-plus Q4_K_M outputs.
quant_eval v7.21 finding: <|im_end|> appears as literal text in raw outputs,
causing final_mismatch in toolcall and blocking extraction in other families.
"""
return re.sub(r'<\|im_end\|>', '', text).strip()
raw = response["choices"][0]["message"]["content"]
clean = strip_eos_tokens(raw)
print(clean)
For stateful multi-turn use (reliable at Q4_K_M):
# Stateful follow-up passes at 1.000 β safe to deploy
conversation = [
{"role": "system", "content": "You are a stateful assistant tracking structured data."},
{"role": "user", "content": "Initialize a counter at 1. Return JSON: {\"counter\": N}"},
]
response1 = llm.create_chat_completion(
messages=conversation,
temperature=0.8,
max_tokens=256,
)
turn1 = response1["choices"][0]["message"]["content"]
conversation.append({"role": "assistant", "content": turn1})
conversation.append({"role": "user", "content": "Increment the counter by 1."})
response2 = llm.create_chat_completion(
messages=conversation,
temperature=0.8,
max_tokens=256,
)
print(strip_eos_tokens(response2["choices"][0]["message"]["content"]))
CLI β llama-cli
llama-cli \
--model phi-4-reasoning-plus-gguf-Q4-K-M.gguf \
--chat-template phi3 \
--system-prompt "You are a precise reasoning assistant." \
--prompt "Analyze the following and return structured JSON output." \
--n-predict 2048 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--temp 0.8
For server deployment:
llama-server \
--model phi-4-reasoning-plus-gguf-Q4-K-M.gguf \
--chat-template phi3 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--port 8080 \
--host 0.0.0.0
Query via the OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")
response = client.chat.completions.create(
model="phi-4-reasoning-plus-gguf-Q4-K-M",
messages=[{"role": "user", "content": "Your prompt here"}],
temperature=0.8,
timeout=120,
)
# Strip EOS contamination before downstream use
import re
clean = re.sub(r'<\|im_end\|>', '', response.choices[0].message.content).strip()
print(clean)
Evaluation Artifacts
The full per-case evaluation CSV (comparison_results_v7_21_Phi_4_reasoning_plus_20260222_170914.csv) and rollup.json are published in this repository for independent verification. Every row in the CSV corresponds to a single inference run against a versioned test fixture, with the raw model output, all signal values, and the detail field documenting the failure reason.
Artifact Provenance
| Artifact | Format | Size | SHA256 |
|---|---|---|---|
phi-4-reasoning-plus-gguf-Q4-K-M.gguf |
GGUF Q4_K_M | 9.05 GB | 2fe74424b03433d11ccf3f2ce8da404810fa7eb9a269135b1f14bf0d88566e4d |
| F16 (companion repo, not evaluated) | GGUF F16 | 29.3 GB | 6491352a2d3d756fdd4b1538f188bafafc8e940658f1771308ffdaeddd86a385 |
Both artifacts were produced from microsoft/Phi-4-reasoning-plus using a custom-built llama.cpp conversion and quantization pipeline developed by PBH Applied Systems.
Evaluation Methodology
quant_eval v7.21 is a proprietary behavioral evaluation harness developed by PBH Applied Systems. This run evaluated the Q4_K_M variant only using a dedicated runner (phi4_reasoning_plus_quant).
Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)
| Family | Description | Pass Signals |
|---|---|---|
fuzz |
Property-based regression; structured placement correctness | schema_ok, constraints_ok |
json |
Single-step structured JSON with constraint rules | schema_ok, constraints_ok |
json_multistep |
Multi-step planning with self-check and oracle verification | schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok |
mcq |
Multiple-choice extraction | choice_ok |
stateful_followup |
Two-turn state tracking; turn-2 correct given turn-1 | turn1/2_parse_ok, turn1/2_exact_match |
mixed_brief_json |
Hybrid: natural language answer + valid JSON block | answer_line_ok, json_parse_ok, schema_ok |
toolcall |
Tool call embedded in response; parse + schema validation | stage1_tool_parse_ok, stage1_tool_schema_ok |
toolcall_only |
Bare schema-only tool call; strict tool name + args check | tool_name_ok, args_ok |
Evaluation hardware: NVIDIA RTX 4090 (24 GB VRAM) Evaluation date: February 22, 2026 quant_eval seed: 42
π¬ About quant_eval & This Evaluation Series
quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning β not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.
See it in action: Live AI Agent Demo β The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.
Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? β pbhappliedsystems.com
Evaluated and published by PBH Applied Systems, LLC Β· patrick@pbhappliedsystems.com
About PBH Applied Systems
PBH Applied Systems, LLC is an Oklahoma Cityβbased applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development. The organization operates with a strong emphasis on engineering rigor, reproducibility, and real-world deployment constraints β particularly in environments where performance, cost efficiency, and reliability must be balanced against available hardware and budget.
Founder β Patrick Hill, M.S.
PBH Applied Systems was founded by Patrick Hill, a Data Scientist and AI/ML Engineer with 10+ years of experience delivering advanced analytics, predictive modeling, and decision-support solutions across high-stakes operational environments. Patrick holds a Master of Science in Software Engineering with concentrations in Artificial Intelligence and Machine Learning (GPA: 4.0) and a B.S. in Business Finance.
Technical expertise spans:
- Languages & Data: Python, SQL, Linux, Pandas, NumPy, scikit-learn
- ML & Modeling: Supervised and unsupervised learning, neural networks, NLP, transformers, regression, classification, forecasting, and feature engineering
- AI/ML Frameworks: PyTorch, TensorFlow/Keras, HuggingFace Transformers, GGUF, llama.cpp, BitsAndBytes, PEFT, QLoRA
- Deployment & MLOps: Flask APIs, Docker, CI/CD pipelines, REST endpoints, streaming inference, version control
- Data Platforms: Jupyter, Databricks, Power BI, Matplotlib
- Quantization: GGUF conversion, Q4_K_M / Q5_K_M / Q8_0 strategies, adapter-per-model evaluation architecture
Published Author
Patrick is the author of Applied Machine Learning: Concepts, Tools, and Case Studies β a 1,200+ page practitioner-oriented textbook adopted as required reading for CSC 373 β Machine Learning at the University of Advancing Technology.
Core Service Areas
1. LLM Optimization & Deployment β End-to-end GGUF conversion and quantization with custom llama.cpp pipelines and adapter-per-model architecture.
2. AI Evaluation Frameworks β Proprietary behavioral evaluation via quant_eval: per-family pass rates, failure cluster diagnostics, raw output evidence, and deployment recommendations.
3. Agentic AI Infrastructure β LlamaIndex ReAct agents, Flask orchestration, serverless GPU inference, full pipeline from model selection to production serving.
4. Scalable AI Application Development β Multimodal applications (quantized LLMs + Whisper + BLIP), Dockerized Flask APIs, advanced time-series forecasting with custom attention mechanisms, Bayesian hyperparameter optimization, and FinBERT sentiment fusion.
5. ML Pipeline Design & Analytics β Feature engineering, forward-chaining cross-validation, KPI dashboards, analytical governance at scale.
6. Model & Agent Cataloging β Structured catalog publishing with reproducible artifacts and clear performance tradeoff documentation.
π Work With PBH Applied Systems
The findings documented in this card β EOS token contamination producing silent failures across planning and MCQ, correct arithmetic answers blocked by stop token handling, prose output where JSON was required β are precisely the kind of deployment risks that casual testing does not surface.
A developer running informal validation would see the json, fuzz, mixed, and stateful families pass. They would not see the 4/5 planning failures, the 5/5 MCQ failures, or the toolcall extraction bug. Those failures reach production silently without systematic evaluation.
This card is not an indictment of the Phi-4-reasoning-plus model. It is documentation that at Q4_K_M precision, with this build configuration, specific task categories fail in specific and reproducible ways. That is information a team needs before deployment β not after.
π Book a Scoping Call β Discuss your model selection, quantization evaluation needs, or deployment architecture directly with Patrick.
π Request an Evaluation Report β A full quant_eval behavioral audit for your target model(s): per-family pass rates, failure cluster diagnostics, raw output evidence, and a deployment recommendation. Engagements from $2,500.
Connect
| π Website | pbhappliedsystems.com |
| π§ Email | patrick@pbhappliedsystems.com |
| πΌ LinkedIn | PBH Applied Systems, LLC |
| βΆοΈ YouTube | @pbhappliedsystems |
| πΈ Instagram | @pbhappliedsystems |
| π Facebook | pbhappliedsystems |
License
This GGUF repository inherits the license of the base model:
MIT β microsoft/Phi-4-reasoning-plus
The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.
GGUF conversion, quantization, and behavioral evaluation performed by PBH Applied Systems, LLC Β· quant_eval v7.21 Β· Run ID: 20260222_170914
- Downloads last month
- 217
We're not able to determine the quantization variants.