Instructions to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M",
	filename="phi-4-reasoning-plus-gguf-Q4-K-M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
# Run inference directly in the terminal:
llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
# Run inference directly in the terminal:
llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
# Run inference directly in the terminal:
./llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M

Use Docker

docker model run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M

LM Studio
Jan
Ollama
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with Ollama:
```
ollama run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
```

Unsloth Studio

How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M to start chatting

Docker Model Runner
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with Docker Model Runner:
```
docker model run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M
```

Lemonade

How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M

Run and chat with the model

lemonade run user.phi-4-reasoning-plus-gguf-Q4-K-M-{{QUANT_TAG}}

List all available models

lemonade list

Phi-4-reasoning-plus · GGUF Q4_K_M

Quantized, converted, and evaluated by PBH Applied Systems, LLC — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure

🔬 This repository is part of a production-oriented evaluation series. Every model published under pbhappliedsystems has been independently evaluated using quant_eval v7.21 — a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families — not perplexity or benchmark leaderboard proxies.

⚠️ This card documents significant evaluation findings. Phi-4-reasoning-plus Q4_K_M produces the lowest reasoning (0.365) and coherence (0.492) scores in the PBH Applied Systems evaluated series. The evaluation surfaces a systematic EOS token contamination pattern that causes complete failures across planning, MCQ, and tool dispatch families. These findings are documented in full below — with raw output evidence — as a demonstration of what rigorous pre-deployment evaluation surfaces that casual testing does not.

Try This Model in the Live AI Agent Demo

Launch the PBH Applied Systems AI Agent Demo →

This model is part of the PBH Applied Systems live AI Agent Demo, where visitors can test evaluated quantized open-weight models across production-style agent workflows: reasoning and analysis, document intelligence, and code automation.

The demo uses quant_eval results to show how model selection changes by task. A model that performs well for long-context document analysis may not be the best choice for hard multi-step planning, strict tool-use workflows, or production code generation. Each deployed model is evaluated for practical agent behavior, including coherence, instruction following, reasoning, task completion, structured output reliability, tool-use behavior, and quantization impact.

For this repository, the Q4_K_M variant represents the deployment-focused model: smaller, faster, and more cost-efficient than the F16 baseline. The evaluation results below explain where this quantized model preserves useful behavior, where quantization introduces risk, and what guardrails are recommended before production deployment.

The purpose of the demo is simple: let prospects test the same kind of evaluated quantized models that PBH Applied Systems deploys for real agentic AI systems.

Model Description

This repository contains the 4-bit quantized (Q4_K_M) GGUF of microsoft/Phi-4-reasoning-plus, a 14-billion parameter reasoning-tuned model from Microsoft. Phi-4-reasoning-plus is a chain-of-thought reasoning variant of the Phi-4 architecture, trained to perform extended internal deliberation before generating output.

Important evaluation scope note: This evaluation was conducted on the Q4_K_M variant only, using a custom runner (phi4_reasoning_plus_quant). The full-precision F16 GGUF was produced (29.3 GB, SHA256 documented below) but was not evaluated in this run. Consequently, no F16 vs. Q4_K_M delta comparison is available for this model. The results below reflect Q4_K_M performance in isolation. Whether an F16 baseline would perform substantially differently is not known from this evaluation — but what is known is that this model at Q4_K_M precision has significant, measurable production deployment risks.

The full-precision F16 GGUF is published separately at pbhappliedsystems/phi-4-reasoning-plus-gguf-F16.

Key Characteristics

Parameters: 14B
Architecture: Reasoning (extended chain-of-thought)
Format: GGUF Q4_K_M
File size: 9.05 GB
SHA256: 2fe74424b03433d11ccf3f2ce8da404810fa7eb9a269135b1f14bf0d88566e4d
Minimum VRAM (GPU inference): ~12 GB (T4 class or better)
Recommended GPU tier: NVIDIA T4 (16 GB) · RTX 3080/4080 · A10G
Context window: 16,384 tokens (per base model specification)
Inference speed (eval hardware): avg 25.84 sec/case on RTX 4090
License: MIT

PBH Applied Systems Evaluation — quant_eval v7.21

Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID: 20260222_170914 · Fixtures: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c...) · Seed: 42 Hardware: NVIDIA RTX 4090 · Runner: phi4_reasoning_plus_quant (Q4_K_M only) · Total rows: 42

No F16 baseline: This run evaluated the Q4_K_M variant only. Scores are not comparable to an F16 baseline because no F16 evaluation was performed. They reflect Q4_K_M performance on a standardized behavioral fixture set, comparable across the PBH Applied Systems evaluated series.

Aggregate Scores (Q4_K_M)

Scores are normalized to [0.0 – 1.0]. Higher is better.

Dimension	Score	Series Context
Task Completion	0.5976	Below series average
Reasoning	0.3648	Lowest in series
Coherence	0.4921	Lowest in series
Instruction Following	0.8658	Within normal range
Avg inference time	25.84 sec/case	Consistent with reasoning architecture

Per-Family Pass Rates (`phi4_reasoning_plus_quant`)

Family	N	Pass Rate	Avg Secs	Bucket Score	Notes
json_multistep	5	0.200	14.52	0.600	4/5 fail — EOS token output
stateful_followup	2	1.000	22.89	2.000	Both turns exact match
toolcall_only	2	0.000	19.95	0.000	Prose output instead of JSON
mixed_brief_json	2	1.000	17.46	2.000	Both pass cleanly
toolcall	2	1.000	13.98	0.000	⚠️ Stage-1 passes; final_mismatch on both — see below
json	4	n/a	42.38	10.000	All pass
fuzz	20	n/a	34.57	10.000	All pass (7.71–93.68s range)
mcq	5	n/a	0.61	0.000	⚠️ All 5 fail — EOS token output

Critical Findings — EOS Token Contamination

The most significant finding from this evaluation is a systematic <|im_end|> token contamination pattern. Across multiple task families, the model emits its end-of-sequence token (<|im_end|>) as literal visible text in its response content, rather than as a functional stop signal. This manifests differently depending on the task format — sometimes producing complete failures, sometimes coexisting with correct output, and sometimes interfering with answer extraction even when the underlying answer is correct.

Finding 1: json_multistep — EOS-Only Responses on 4/5 Cases

Case	Difficulty	Result	Secs	Raw Output
ms_easy_01	Easy	❌ FAIL	13.00	`<\|im_end\|>`
ms_easy_02	Easy	❌ FAIL	13.05	`<\|im_end\|>`
ms_med_01	Medium	✅ PASS	20.84	Valid JSON plan
ms_med_02	Medium	❌ FAIL	12.82	`<\|im_end\|>`
ms_hard_01	Hard	❌ FAIL	12.89	`<\|im_end\|>`

Four of five json_multistep cases produce <|im_end|> as their entire response. The model generates internal reasoning for 12–13 seconds, then emits only the EOS token — no plan, no checks, no final state. Every gating signal fails simultaneously (schema_ok=0, checks_consistent_ok=0, stop_semantics_ok=0, oracle_equiv_ok=0).

Only ms_med_01 produces a valid response (20.84 seconds, valid JSON plan, all signals pass). The one working case takes longer, suggesting the model successfully completes its reasoning chain on that input and emits a real response. The failing cases suggest the model abandons generation and terminates early via EOS for those specific prompts.

This is not a planning capability failure in the conventional sense — the model is not producing wrong plans. It is producing no plan at all on 4 of 5 cases.

Finding 2: MCQ — All 5 Cases Fail with EOS Output

Every MCQ case produces <|im_end|> as its raw output:

Case	Secs	Detail	Raw
mcq_01	0.47	`invalid_choice raw='<\|im_end\|>'`	`<\|im_end\|>`
mcq_02	0.62	`invalid_choice raw='<\|im_end\|>'`	`<\|im_end\|>`
mcq_03	0.91	`invalid_choice raw='<\|im_end\|>'`	`<\|im_end\|>`
mcq_04	0.16	`invalid_choice raw='<\|im_end\|>'`	`<\|im_end\|>`
mcq_05	0.89	`invalid_choice raw='<\|im_end\|>'`	`<\|im_end\|>`

All five MCQ cases are answered in under one second with an EOS token. The model produces no choice letter, no reasoning, no response — just termination. This results in a bucket_score average of 0.000 for MCQ across all five cases.

Finding 3: toolcall — Correct Arithmetic, Failed Extraction

toolcall passes at 1.000 (both stage-1 signals pass) but achieves bucket_score=0.000 on both cases due to final_mismatch. The raw outputs reveal what is happening:

Case	Secs	Raw Output	Expected	Result
tool_01	12.27	`{"tool_name": "add", "args": {"a": 2, "b": 3}}<\|im_end\|> 5<\|im_end\|>`	`5`	❌ final_mismatch
tool_02	15.70	`{"tool_name": "add", "args": {"a": 10, "b": -4}}<\|im_end\|> 6<\|im_end\|>`	`6`	❌ final_mismatch

The arithmetic is correct. add(2, 3) = 5 ✓ and add(10, -4) = 6 ✓. The model knows what to compute and computes it correctly. The failure is purely mechanical: the EOS token is embedded within the response string (5<|im_end|>), causing the answer extractor to capture the contaminated string rather than the clean numeric result.

The tool dispatch itself is valid — the stage-1 JSON parses correctly and validates against schema. This is a stop-token handling issue, not an arithmetic or tool-calling capability failure.

Finding 4: toolcall_only — Reasoning Prose Instead of JSON

Both toolcall_only cases produce natural language reasoning rather than the required JSON tool call:

toolonly_01: "In your answer, include the result and a brief explanation of how you arrived at..."
toolonly_02: "Thought: The user's instruction is to 'Use add tool to add 25 and 75.' Since this..."

Neither case produces a JSON object (detail=no_json_object). The model defaults to its natural reasoning-first format — generating explanatory prose — when asked for bare schema-only output. This is consistent with reasoning model architecture behavior observed across the series, but the failure is total here: not even the tool name is extracted.

What Passes and Why

The families that pass — json, fuzz, mixed_brief_json, stateful_followup — have output formats where the EOS token coexists with valid content without blocking extraction:

json/fuzz: Each turn produces {"tool_name": "...", "args": {...}}<|im_end|> — the JSON block precedes the EOS token and is extracted cleanly before the termination
mixed_brief_json: Output format is ANSWER: 13 {"a": 4, "b": 9, "sum": 13}<|im_end|> — the answer and JSON precede the EOS token
stateful_followup: Multi-turn state JSON precedes EOS in each turn

The common thread: when the required content appears before the EOS token, extraction succeeds. When the EOS token is the only content (json_multistep, MCQ), extraction fails. When it appears after a number that should be matched (toolcall final answer), extraction captures the contaminated string.

Signal-Level Diagnostics

json_multistep

Signal	Rate	Tier
schema_ok	0.200	Tier-1 (gating)
checks_consistent_ok	0.200	Tier-1 (gating)
stop_semantics_ok	0.200	Tier-1 (gating)
oracle_equiv_ok	0.200	Tier-1 (gating)
final_consistent_ok	0.000	Tier-2 (tracked, non-gating)
final_match_reported	0.000	Tier-2 (tracked, non-gating)

All four gating signals have identical rates (0.200 = 1/5 pass). On the four failing cases, every signal fails simultaneously — because the raw output is <|im_end|>, there is nothing to evaluate.

stateful_followup

Signal	Rate
turn1_parse_ok	1.000
turn2_parse_ok	1.000
turn1_exact_match	1.000
turn2_exact_match	1.000

toolcall_only

Signal	Rate
tool_name_ok	0.000
args_ok	0.000

mixed_brief_json

Signal	Rate
answer_line_ok	1.000
json_parse_ok	1.000
schema_ok	1.000

Recommended Use Cases

✅ Deploy with Confidence (Q4_K_M)

Stateful multi-turn agents — Both turns parse and match exactly (1.000). The state update format is unaffected by EOS contamination.
Hybrid brief + JSON outputs — mixed_brief_json passes at 1.000. The ANSWER: X {json} format works cleanly.
Single-step structured JSON — json and fuzz both achieve bucket_score 10.000. Constraint-adherent placements on all cases.

⚠️ Use with Modified Output Handling (Q4_K_M)

Scaffolded tool-calling — toolcall stage-1 passes at 1.000 and arithmetic is correct, but add an EOS token stripping step before final answer extraction. The capability is present; the stop token handling requires remediation.

❌ Not Recommended (Q4_K_M)

Multi-step planning — 4/5 cases produce no output. Do not deploy for planning workflows without validated prompt engineering that prevents EOS-only responses.
MCQ / single-choice extraction — All 5 cases fail with EOS-only output. This format is completely non-functional at Q4_K_M.
Bare tool-call dispatch (schema-only) — toolcall_only produces prose reasoning instead of JSON on both cases. Not viable without substantial prompt engineering.
Any latency-sensitive application — At 25.84 sec/case average with fuzz cases peaking at 93.68 seconds, this model is not suitable for responsive workloads.

The Evaluation Report Pitch — In Data

The findings above are the practical argument for systematic pre-deployment evaluation. Consider what casual testing would show:

Run a few json shelf-placement queries → all pass, bucket=10 ✓
Run a stateful follow-up conversation → passes ✓
Ask it to add two numbers → produces the right answer ✓

A developer doing informal validation would likely conclude this model works well for structured output and tool use. They would not know:

That planning prompts produce silent EOS failures on 4/5 cases
That every MCQ query terminates with an EOS token and no answer
That the addition result is correct but structurally broken in a way that would fail any downstream string comparison

None of these failure modes are visible without running the model against a standardized behavioral test suite across all relevant task families. The quant_eval evaluation surfaces them in 42 rows of structured, verifiable, reproducible evidence.

This is what a Quantized Model Evaluation Report documents — not whether a model can answer a few test questions, but what its actual failure modes are across the task families that matter in production.

Hardware Requirements

Configuration	VRAM Required	Recommended GPU
Q4_K_M (this repo) · GPU only	~12 GB	T4 16 GB · RTX 3080/4080 · A10G
Q4_K_M · CPU offload fallback	8 GB VRAM + 4 GB RAM	Any CUDA-capable GPU
F16 baseline (companion repo)	~32 GB	A100 40 GB · 2× A10G

Usage

Installation

pip install llama-cpp-python huggingface_hub

For GPU acceleration (CUDA):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Python — llama-cpp-python

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(
    repo_id="pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M",
    filename="phi-4-reasoning-plus-gguf-Q4-K-M.gguf"
)

llm = Llama(
    model_path=model_path,
    n_ctx=8192,
    n_gpu_layers=-1,
    verbose=False,
)

response = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a precise assistant. Think through the problem carefully before responding."
        },
        {
            "role": "user",
            "content": "Analyze the following data and return a structured JSON summary with keys: findings, confidence, recommendation."
        }
    ],
    temperature=0.15,
    max_tokens=2048,
)

print(response["choices"][0]["message"]["content"])

For tasks where EOS token contamination is a known risk, add a cleanup step before downstream processing:

import re

def strip_eos_tokens(text: str) -> str:
    """
    Strip EOS token contamination from Phi-4-reasoning-plus Q4_K_M outputs.
    quant_eval v7.21 finding: <|im_end|> appears as literal text in raw outputs,
    causing final_mismatch in toolcall and blocking extraction in other families.
    """
    return re.sub(r'<\|im_end\|>', '', text).strip()

raw = response["choices"][0]["message"]["content"]
clean = strip_eos_tokens(raw)
print(clean)

For stateful multi-turn use (reliable at Q4_K_M):

# Stateful follow-up passes at 1.000 — safe to deploy
conversation = [
    {"role": "system", "content": "You are a stateful assistant tracking structured data."},
    {"role": "user", "content": "Initialize a counter at 1. Return JSON: {\"counter\": N}"},
]

response1 = llm.create_chat_completion(
    messages=conversation,
    temperature=0.8,
    max_tokens=256,
)
turn1 = response1["choices"][0]["message"]["content"]
conversation.append({"role": "assistant", "content": turn1})
conversation.append({"role": "user", "content": "Increment the counter by 1."})

response2 = llm.create_chat_completion(
    messages=conversation,
    temperature=0.8,
    max_tokens=256,
)
print(strip_eos_tokens(response2["choices"][0]["message"]["content"]))

CLI — llama-cli

llama-cli \
  --model phi-4-reasoning-plus-gguf-Q4-K-M.gguf \
  --chat-template phi3 \
  --system-prompt "You are a precise reasoning assistant." \
  --prompt "Analyze the following and return structured JSON output." \
  --n-predict 2048 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --temp 0.8

For server deployment:

llama-server \
  --model phi-4-reasoning-plus-gguf-Q4-K-M.gguf \
  --chat-template phi3 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --port 8080 \
  --host 0.0.0.0

Query via the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")

response = client.chat.completions.create(
    model="phi-4-reasoning-plus-gguf-Q4-K-M",
    messages=[{"role": "user", "content": "Your prompt here"}],
    temperature=0.8,
    timeout=120,
)
# Strip EOS contamination before downstream use
import re
clean = re.sub(r'<\|im_end\|>', '', response.choices[0].message.content).strip()
print(clean)

Evaluation Artifacts

The full per-case evaluation CSV (comparison_results_v7_21_Phi_4_reasoning_plus_20260222_170914.csv) and rollup.json are published in this repository for independent verification. Every row in the CSV corresponds to a single inference run against a versioned test fixture, with the raw model output, all signal values, and the detail field documenting the failure reason.

Artifact Provenance

Artifact	Format	Size	SHA256
`phi-4-reasoning-plus-gguf-Q4-K-M.gguf`	GGUF Q4_K_M	9.05 GB	`2fe74424b03433d11ccf3f2ce8da404810fa7eb9a269135b1f14bf0d88566e4d`
F16 (companion repo, not evaluated)	GGUF F16	29.3 GB	`6491352a2d3d756fdd4b1538f188bafafc8e940658f1771308ffdaeddd86a385`

Both artifacts were produced from microsoft/Phi-4-reasoning-plus using a custom-built llama.cpp conversion and quantization pipeline developed by PBH Applied Systems.

Evaluation Methodology

quant_eval v7.21 is a proprietary behavioral evaluation harness developed by PBH Applied Systems. This run evaluated the Q4_K_M variant only using a dedicated runner (phi4_reasoning_plus_quant).

Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)

Family	Description	Pass Signals
`fuzz`	Property-based regression; structured placement correctness	schema_ok, constraints_ok
`json`	Single-step structured JSON with constraint rules	schema_ok, constraints_ok
`json_multistep`	Multi-step planning with self-check and oracle verification	schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok
`mcq`	Multiple-choice extraction	choice_ok
`stateful_followup`	Two-turn state tracking; turn-2 correct given turn-1	turn1/2_parse_ok, turn1/2_exact_match
`mixed_brief_json`	Hybrid: natural language answer + valid JSON block	answer_line_ok, json_parse_ok, schema_ok
`toolcall`	Tool call embedded in response; parse + schema validation	stage1_tool_parse_ok, stage1_tool_schema_ok
`toolcall_only`	Bare schema-only tool call; strict tool name + args check	tool_name_ok, args_ok

Evaluation hardware: NVIDIA RTX 4090 (24 GB VRAM) Evaluation date: February 22, 2026 quant_eval seed: 42

🔬 About quant_eval & This Evaluation Series

quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning — not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.

See it in action: Live AI Agent Demo → The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.

Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? → pbhappliedsystems.com

Evaluated and published by PBH Applied Systems, LLC · patrick@pbhappliedsystems.com

About PBH Applied Systems

PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development. The organization operates with a strong emphasis on engineering rigor, reproducibility, and real-world deployment constraints — particularly in environments where performance, cost efficiency, and reliability must be balanced against available hardware and budget.

Founder — Patrick Hill, M.S.

PBH Applied Systems was founded by Patrick Hill, a Data Scientist and AI/ML Engineer with 10+ years of experience delivering advanced analytics, predictive modeling, and decision-support solutions across high-stakes operational environments. Patrick holds a Master of Science in Software Engineering with concentrations in Artificial Intelligence and Machine Learning (GPA: 4.0) and a B.S. in Business Finance.

Technical expertise spans:

Languages & Data: Python, SQL, Linux, Pandas, NumPy, scikit-learn
ML & Modeling: Supervised and unsupervised learning, neural networks, NLP, transformers, regression, classification, forecasting, and feature engineering
AI/ML Frameworks: PyTorch, TensorFlow/Keras, HuggingFace Transformers, GGUF, llama.cpp, BitsAndBytes, PEFT, QLoRA
Deployment & MLOps: Flask APIs, Docker, CI/CD pipelines, REST endpoints, streaming inference, version control
Data Platforms: Jupyter, Databricks, Power BI, Matplotlib
Quantization: GGUF conversion, Q4_K_M / Q5_K_M / Q8_0 strategies, adapter-per-model evaluation architecture

Published Author

Patrick is the author of Applied Machine Learning: Concepts, Tools, and Case Studies — a 1,200+ page practitioner-oriented textbook adopted as required reading for CSC 373 – Machine Learning at the University of Advancing Technology.

Core Service Areas

1. LLM Optimization & Deployment — End-to-end GGUF conversion and quantization with custom llama.cpp pipelines and adapter-per-model architecture.

2. AI Evaluation Frameworks — Proprietary behavioral evaluation via quant_eval: per-family pass rates, failure cluster diagnostics, raw output evidence, and deployment recommendations.

3. Agentic AI Infrastructure — LlamaIndex ReAct agents, Flask orchestration, serverless GPU inference, full pipeline from model selection to production serving.

4. Scalable AI Application Development — Multimodal applications (quantized LLMs + Whisper + BLIP), Dockerized Flask APIs, advanced time-series forecasting with custom attention mechanisms, Bayesian hyperparameter optimization, and FinBERT sentiment fusion.

5. ML Pipeline Design & Analytics — Feature engineering, forward-chaining cross-validation, KPI dashboards, analytical governance at scale.

6. Model & Agent Cataloging — Structured catalog publishing with reproducible artifacts and clear performance tradeoff documentation.

📞 Work With PBH Applied Systems

The findings documented in this card — EOS token contamination producing silent failures across planning and MCQ, correct arithmetic answers blocked by stop token handling, prose output where JSON was required — are precisely the kind of deployment risks that casual testing does not surface.

A developer running informal validation would see the json, fuzz, mixed, and stateful families pass. They would not see the 4/5 planning failures, the 5/5 MCQ failures, or the toolcall extraction bug. Those failures reach production silently without systematic evaluation.

This card is not an indictment of the Phi-4-reasoning-plus model. It is documentation that at Q4_K_M precision, with this build configuration, specific task categories fail in specific and reproducible ways. That is information a team needs before deployment — not after.

👉 Book a Scoping Call — Discuss your model selection, quantization evaluation needs, or deployment architecture directly with Patrick.

👉 Request an Evaluation Report — A full quant_eval behavioral audit for your target model(s): per-family pass rates, failure cluster diagnostics, raw output evidence, and a deployment recommendation. Engagements from $2,500.

Connect


🌐 Website	pbhappliedsystems.com
📧 Email	patrick@pbhappliedsystems.com
💼 LinkedIn	PBH Applied Systems, LLC
▶️ YouTube	@pbhappliedsystems
📸 Instagram	@pbhappliedsystems
👍 Facebook	pbhappliedsystems

License

This GGUF repository inherits the license of the base model: MIT — microsoft/Phi-4-reasoning-plus

The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.

GGUF conversion, quantization, and behavioral evaluation performed by PBH Applied Systems, LLC · quant_eval v7.21 · Run ID: 20260222_170914

Downloads last month: 217

GGUF

Model size

15B params

Architecture

phi3

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M

Base model

microsoft/phi-4

Finetuned

microsoft/Phi-4-reasoning-plus

Quantized

(44)

this model