If it were purely due to the repository, the frequency of occurrences wouldn’t have suddenly increased recently…
Given that there have been several reports of failures related to “scale-from-zero wake-up,” a bug on the HF side is also a possibility.
My view is:
Most likely: a problem in the scale-from-zero wake-up path or the platform scheduler / orchestration layer.
Second most likely: a custom-container readiness or port-alignment problem.
Less likely: a dependency or image drift problem.
Least likely: a recent breaking spec change. (Hugging Face)
The background
When a Hugging Face Inference Endpoint starts, several things must succeed in order:
- the platform must allocate the instance,
- pull and start the container,
- mount the model at
/repository,
- wait for the app to become healthy,
- then mark the endpoint as ready. (Hugging Face)
For custom containers, Hugging Face says the platform probes /health every second, and that route should return 503 until the model is actually ready. Hugging Face also says that if the logs show the app is running but the endpoint still says Initializing, the usual cause is incorrect port mapping. (Hugging Face)
That matters because Initializing is not one bug. It is a phase. Failures can happen before the model server starts, while it starts, or after it starts but before readiness is accepted. (Hugging Face)
Why I think wake-from-zero is the top suspect
Hugging Face’s autoscaling guide says scale-to-zero introduces a cold start, that the proxy can return 503 while a new replica initializes, and that waking from 0 can take a few minutes, which is why request-driven wake-up is “typically not recommended” for applications that need responsiveness. They also provide X-Scale-Up-Timeout specifically for this path. (Hugging Face)
There are also public reports of this exact class of failure:
- a scaled-to-zero endpoint that stopped waking on HTTP request and did not return the documented
503, which Hugging Face staff said they would investigate, (Hugging Face Forums)
- and another case where scale-to-zero led to 500 Internal Server Error, the replica did not scale back up automatically, and users worked around it by sending a probe request and waiting before real traffic. (Hugging Face Forums)
So if your symptom is “works normally once it is up, but sometimes gets stuck during bring-up, especially after idling,” the best first explanation is resume-path instability, not “the model code itself is broken.” That is an inference from the documented cold-start behavior plus the similar public cases. (Hugging Face)
Why platform scheduling or infra is also very plausible
Hugging Face forum history shows endpoint startup failures caused by hardware capacity and regional platform issues. In one public case, the error was Scheduling failure: not enough hardware capacity, and HF staff replied there had been a minor issue in eu-west-1. (Hugging Face Forums)
That matters because startup can fail before your application is really serving. When that happens, the endpoint can sit in initialization without much useful application-level evidence. That last sentence is an inference, but it follows from HF’s documented startup stages and from the fact that capacity failures are public, real, and external to user code. (Hugging Face)
Right now, HF’s public status page shows Inference Endpoints, Inference Endpoints UI, and Inference Endpoints API as Operational on April 1, 2026. That makes a broad, public, service-wide outage less likely at this moment. It does not rule out a region-specific, GPU-class-specific, or quota-specific problem. (Hugging Face Status)
Why custom-container config is the main user-side suspect
Hugging Face’s FAQ is very direct:
- if the app is running in logs but the endpoint is still
Initializing, the usual cause is port mapping mismatch, and
- if you get 500s at deployment start or during scaling, you should make sure the app has a health route that returns 200 only when it is truly ready. (Hugging Face)
That means there are two classic custom-container mistakes:
1. Port mismatch
The app listens on one port, but the endpoint config expects another. HF says the default expectation is port 80, unless you explicitly change it and keep the values aligned. (Hugging Face)
2. Readiness too early
The container process starts, so the platform thinks it is ready, but the model is still loading. HF’s custom-container docs explicitly show the intended pattern: /health should return 503 until the model and tokenizer are fully initialized. (Hugging Face)
How this maps to hommayushi3/vllm-huggingface
The wrapper itself is simple:
- its Dockerfile pins
FROM vllm/vllm-openai:v0.6.6.post1, (GitHub)
- the entrypoint uses
MODEL_PATH=/repository, which matches HF’s documented mount point, (GitHub)
- and it launches
vllm serve on host 0.0.0.0 and port 80. (GitHub)
So the wrapper is not obviously wrong on the basic HF contract. It serves from /repository, and it uses port 80, which matches HF’s default expectation. (GitHub)
But I still see three wrapper-related risks:
A. Old base image
It is pinned to v0.6.6.post1, which is an old vLLM image. Old does not mean broken, but it means you inherit older startup behavior and older bugs. (GitHub)
B. Readiness depends on vLLM’s startup behavior
This wrapper does not add its own richer readiness logic. It mainly shells into vllm serve. That can make the container more sensitive to timing around startup and health checks. This is an inference from the entrypoint design plus HF’s readiness rules. (Hugging Face)
C. One config variable looks weakly wired
The script sets VLLM_ATTENTION_BACKEND, but the shown vllm serve command does not include that variable as a command-line option, and the snippet does not show it being exported before command execution. That suggests the setting may not actually affect the launched process. This is a code-level inference from the entrypoint script, not a confirmed public bug report. (GitHub)
What I think is happening in plain English
The likely story is this:
- the endpoint goes idle,
- a new request arrives,
- HF tries to wake the deployment,
- sometimes that wake-up path stalls at the platform or readiness boundary,
- a retry or restart causes the whole sequence to be attempted again,
- and then it succeeds. (Hugging Face)
That fits better than “the model image is permanently broken,” because a permanently broken image usually fails the same way every time. The repeated-success-after-retry pattern points more toward intermittent orchestration / cold-start / scheduling behavior. This is an inference, but it is the one most consistent with the docs and similar public cases. (Hugging Face)
My ranking for your case
1. Scale-from-zero wake-up problem
Best fit. HF documents the cold-start path, and there are similar reports where the endpoint did not wake correctly from 0. (Hugging Face)
2. Platform scheduling / capacity problem
Also a strong fit. HF has public cases of startup failures caused by unavailable hardware or regional issues. (Hugging Face Forums)
3. Custom-container readiness or port problem
Real possibility, especially if the endpoint config and container config are not perfectly aligned. HF explicitly calls this out. (Hugging Face)
4. Dependency or image drift
Possible, but weaker. Nothing in the public docs or the wrapper repo strongly points to a new breaking change here. The wrapper image is pinned to an old base rather than obviously changing underneath you. (GitHub)
5. Breaking HF spec change
Least likely. Current HF docs still describe the same basic behavior and requirements. (Hugging Face)
The fixes I would try, in order
1. Turn off scale-to-zero temporarily
Set min replicas = 1 for a while. If the problem disappears, that is strong evidence that the bug is in the wake-from-zero path. HF’s docs say the endpoint stays available with the configured minimum replicas, and their FAQ recommends at least 2 replicas when high availability matters. (Hugging Face)
2. Use X-Scale-Up-Timeout
If you keep scale-to-zero, add X-Scale-Up-Timeout, for example 600, so the proxy holds the request while the replica wakes. HF documents this specifically for scale-up from zero. (Hugging Face)
3. Verify port and health settings end to end
Check that:
- the endpoint config expects the same port the container serves,
- the container actually exposes that port,
- the configured health route is correct,
- and readiness only goes green when the model is truly loaded. (Hugging Face)
For this wrapper specifically, the server command uses port 80, so your HF endpoint config should match that unless you changed the image or command. (GitHub)
4. Treat the image as immutable
Do not rely on an unpinned deployment reference. Use a specific image tag or digest. The wrapper repo’s Dockerfile is pinned to a base image version, but your deployment should also pin the outer image reference you use. That makes failures reproducible. (GitHub)
5. Try a control deployment
Deploy either:
- the same image in another region or instance class, or
- a simpler known-good endpoint in the same region and GPU class.
If the simpler control also hangs on bring-up, that argues for platform-side scheduling or availability rather than your app. This is an inference, but it is the cleanest operational test. The public capacity issue reports are the reason this is worth doing. (Hugging Face Forums)
A simple decision rule
Use this:
-
Only breaks when waking from zero
→ suspect scale-to-zero / resume path first. (Hugging Face)
-
Breaks on every fresh deployment and every warm restart
→ suspect port mapping / health route / container boot first. (Hugging Face)
-
Shows explicit scheduling or capacity messages
→ suspect HF infra / region / hardware availability first. (Hugging Face Forums)
Final answer
My best diagnosis is:
This is probably an intermittent cold-start orchestration issue, made more visible by scale-to-zero, with custom-container readiness as the main secondary cause.
So the first thing I would do is disable scale-to-zero. The second is verify port 80 and /health behavior. The third is try a control deployment in another region or GPU class. Those three steps separate platform problems from container problems quickly, and they line up with HF’s documented behavior and the closest public failure reports. (Hugging Face)