Inference Endpoint Hanging in "Initializing"

Hello! I’ve been running into a problem with the inference endpoints and I was wondering if anyone has any suggestions.

Problem Description

I’ve been had three models deployed on Inference Endpoints for about a year now. They are all using A100 GPUs hosted on AWS and scale to zero when not in use. In the past week I’ve had the endpoints not get past the initialization step without manual intervention. This problem is intermittent.

I know that the scaling up means that the endpoints need to wait until AWS hardware is available, but the issue is that the endpoints sometimes never seem to get past initialization on their own. I need to manually go in, force scale to zero, then start them back up. After the manual intervention they always get past initialization which makes me think this issue is not just waiting for hardware to be available.

At first we saw this just once in a 6 month period, then suddenly it happens multiple times a day.

Additional Information

Logs

No logs are generated, the endpoint hangs without seemingly even entering the inference engine container.

System Configuration

Models: LLama 3.1 8B custom varient
GPU: A100
Vendor: AWS
Inference Engine Container: hommayushi3/vllm-huggingface
Scaling: Scale to zero when not in use

1 Like

If it were purely due to the repository, the frequency of occurrences wouldn’t have suddenly increased recently…
Given that there have been several reports of failures related to “scale-from-zero wake-up,” a bug on the HF side is also a possibility.:thinking:


My view is:

Most likely: a problem in the scale-from-zero wake-up path or the platform scheduler / orchestration layer.
Second most likely: a custom-container readiness or port-alignment problem.
Less likely: a dependency or image drift problem.
Least likely: a recent breaking spec change. (Hugging Face)

The background

When a Hugging Face Inference Endpoint starts, several things must succeed in order:

  1. the platform must allocate the instance,
  2. pull and start the container,
  3. mount the model at /repository,
  4. wait for the app to become healthy,
  5. then mark the endpoint as ready. (Hugging Face)

For custom containers, Hugging Face says the platform probes /health every second, and that route should return 503 until the model is actually ready. Hugging Face also says that if the logs show the app is running but the endpoint still says Initializing, the usual cause is incorrect port mapping. (Hugging Face)

That matters because Initializing is not one bug. It is a phase. Failures can happen before the model server starts, while it starts, or after it starts but before readiness is accepted. (Hugging Face)

Why I think wake-from-zero is the top suspect

Hugging Face’s autoscaling guide says scale-to-zero introduces a cold start, that the proxy can return 503 while a new replica initializes, and that waking from 0 can take a few minutes, which is why request-driven wake-up is “typically not recommended” for applications that need responsiveness. They also provide X-Scale-Up-Timeout specifically for this path. (Hugging Face)

There are also public reports of this exact class of failure:

  • a scaled-to-zero endpoint that stopped waking on HTTP request and did not return the documented 503, which Hugging Face staff said they would investigate, (Hugging Face Forums)
  • and another case where scale-to-zero led to 500 Internal Server Error, the replica did not scale back up automatically, and users worked around it by sending a probe request and waiting before real traffic. (Hugging Face Forums)

So if your symptom is “works normally once it is up, but sometimes gets stuck during bring-up, especially after idling,” the best first explanation is resume-path instability, not “the model code itself is broken.” That is an inference from the documented cold-start behavior plus the similar public cases. (Hugging Face)

Why platform scheduling or infra is also very plausible

Hugging Face forum history shows endpoint startup failures caused by hardware capacity and regional platform issues. In one public case, the error was Scheduling failure: not enough hardware capacity, and HF staff replied there had been a minor issue in eu-west-1. (Hugging Face Forums)

That matters because startup can fail before your application is really serving. When that happens, the endpoint can sit in initialization without much useful application-level evidence. That last sentence is an inference, but it follows from HF’s documented startup stages and from the fact that capacity failures are public, real, and external to user code. (Hugging Face)

Right now, HF’s public status page shows Inference Endpoints, Inference Endpoints UI, and Inference Endpoints API as Operational on April 1, 2026. That makes a broad, public, service-wide outage less likely at this moment. It does not rule out a region-specific, GPU-class-specific, or quota-specific problem. (Hugging Face Status)

Why custom-container config is the main user-side suspect

Hugging Face’s FAQ is very direct:

  • if the app is running in logs but the endpoint is still Initializing, the usual cause is port mapping mismatch, and
  • if you get 500s at deployment start or during scaling, you should make sure the app has a health route that returns 200 only when it is truly ready. (Hugging Face)

That means there are two classic custom-container mistakes:

1. Port mismatch

The app listens on one port, but the endpoint config expects another. HF says the default expectation is port 80, unless you explicitly change it and keep the values aligned. (Hugging Face)

2. Readiness too early

The container process starts, so the platform thinks it is ready, but the model is still loading. HF’s custom-container docs explicitly show the intended pattern: /health should return 503 until the model and tokenizer are fully initialized. (Hugging Face)

How this maps to hommayushi3/vllm-huggingface

The wrapper itself is simple:

  • its Dockerfile pins FROM vllm/vllm-openai:v0.6.6.post1, (GitHub)
  • the entrypoint uses MODEL_PATH=/repository, which matches HF’s documented mount point, (GitHub)
  • and it launches vllm serve on host 0.0.0.0 and port 80. (GitHub)

So the wrapper is not obviously wrong on the basic HF contract. It serves from /repository, and it uses port 80, which matches HF’s default expectation. (GitHub)

But I still see three wrapper-related risks:

A. Old base image

It is pinned to v0.6.6.post1, which is an old vLLM image. Old does not mean broken, but it means you inherit older startup behavior and older bugs. (GitHub)

B. Readiness depends on vLLM’s startup behavior

This wrapper does not add its own richer readiness logic. It mainly shells into vllm serve. That can make the container more sensitive to timing around startup and health checks. This is an inference from the entrypoint design plus HF’s readiness rules. (Hugging Face)

C. One config variable looks weakly wired

The script sets VLLM_ATTENTION_BACKEND, but the shown vllm serve command does not include that variable as a command-line option, and the snippet does not show it being exported before command execution. That suggests the setting may not actually affect the launched process. This is a code-level inference from the entrypoint script, not a confirmed public bug report. (GitHub)

What I think is happening in plain English

The likely story is this:

  • the endpoint goes idle,
  • a new request arrives,
  • HF tries to wake the deployment,
  • sometimes that wake-up path stalls at the platform or readiness boundary,
  • a retry or restart causes the whole sequence to be attempted again,
  • and then it succeeds. (Hugging Face)

That fits better than “the model image is permanently broken,” because a permanently broken image usually fails the same way every time. The repeated-success-after-retry pattern points more toward intermittent orchestration / cold-start / scheduling behavior. This is an inference, but it is the one most consistent with the docs and similar public cases. (Hugging Face)

My ranking for your case

1. Scale-from-zero wake-up problem

Best fit. HF documents the cold-start path, and there are similar reports where the endpoint did not wake correctly from 0. (Hugging Face)

2. Platform scheduling / capacity problem

Also a strong fit. HF has public cases of startup failures caused by unavailable hardware or regional issues. (Hugging Face Forums)

3. Custom-container readiness or port problem

Real possibility, especially if the endpoint config and container config are not perfectly aligned. HF explicitly calls this out. (Hugging Face)

4. Dependency or image drift

Possible, but weaker. Nothing in the public docs or the wrapper repo strongly points to a new breaking change here. The wrapper image is pinned to an old base rather than obviously changing underneath you. (GitHub)

5. Breaking HF spec change

Least likely. Current HF docs still describe the same basic behavior and requirements. (Hugging Face)

The fixes I would try, in order

1. Turn off scale-to-zero temporarily

Set min replicas = 1 for a while. If the problem disappears, that is strong evidence that the bug is in the wake-from-zero path. HF’s docs say the endpoint stays available with the configured minimum replicas, and their FAQ recommends at least 2 replicas when high availability matters. (Hugging Face)

2. Use X-Scale-Up-Timeout

If you keep scale-to-zero, add X-Scale-Up-Timeout, for example 600, so the proxy holds the request while the replica wakes. HF documents this specifically for scale-up from zero. (Hugging Face)

3. Verify port and health settings end to end

Check that:

  • the endpoint config expects the same port the container serves,
  • the container actually exposes that port,
  • the configured health route is correct,
  • and readiness only goes green when the model is truly loaded. (Hugging Face)

For this wrapper specifically, the server command uses port 80, so your HF endpoint config should match that unless you changed the image or command. (GitHub)

4. Treat the image as immutable

Do not rely on an unpinned deployment reference. Use a specific image tag or digest. The wrapper repo’s Dockerfile is pinned to a base image version, but your deployment should also pin the outer image reference you use. That makes failures reproducible. (GitHub)

5. Try a control deployment

Deploy either:

  • the same image in another region or instance class, or
  • a simpler known-good endpoint in the same region and GPU class.

If the simpler control also hangs on bring-up, that argues for platform-side scheduling or availability rather than your app. This is an inference, but it is the cleanest operational test. The public capacity issue reports are the reason this is worth doing. (Hugging Face Forums)

A simple decision rule

Use this:

  • Only breaks when waking from zero
    → suspect scale-to-zero / resume path first. (Hugging Face)

  • Breaks on every fresh deployment and every warm restart
    → suspect port mapping / health route / container boot first. (Hugging Face)

  • Shows explicit scheduling or capacity messages
    → suspect HF infra / region / hardware availability first. (Hugging Face Forums)

Final answer

My best diagnosis is:

This is probably an intermittent cold-start orchestration issue, made more visible by scale-to-zero, with custom-container readiness as the main secondary cause.

So the first thing I would do is disable scale-to-zero. The second is verify port 80 and /health behavior. The third is try a control deployment in another region or GPU class. Those three steps separate platform problems from container problems quickly, and they line up with HF’s documented behavior and the closest public failure reports. (Hugging Face)

I think you might be right with a bugging on Hugging Face’s scale-from-zero. After about a week of it being a fairly common occurrence it has seemed to resolve itself? So likely a fix was pushed from Hugging Face. I haven’t seen any official notice of a patch however so I’ll keep my eye on things.

Thanks for the insight!

1 Like