Apply for a GPU community grant: Academic project

#1
by davidhendriks - opened
UniverseTBD org

Talkfinder-Astro is an open-science semantic search and discovery platform for astrophysics seminar talks, lectures, and conference presentations. The project indexes thousands of publicly available talks with transcripts and metadata, enabling researchers, students, and educators to search across spoken scientific content using hybrid retrieval (BM25 + embeddings) followed by a reranking step.

The platform is fully open, research-oriented, and designed to improve accessibility to scientific knowledge that is otherwise fragmented across institutes and video platforms. It supports exploratory queries, topic discovery, and long-tail scientific questions that are poorly served by traditional keyword search.

Why a GPU is needed:
The main performance bottleneck is the neural reranking stage, which evaluates query–document pairs using a transformer-based cross-encoder. On CPU-only infrastructure, even a highly constrained configuration results in ~10 seconds latency per query on Hugging Face Spaces. This forces very small candidate sets, directly degrading result quality and user experience.

Access to a community GPU would:

  • Reduce query latency by an order of magnitude
  • Allow larger candidate pools for reranking, improving retrieval quality
  • Enable fair benchmarking and optimization of retrieval strategies
  • Make the public demo responsive enough for real community use

Without GPU access, meaningful interaction and further open development of the project are severely limited.

Open science impact:

  • Public, free access to scientific talks and knowledge
  • Improves discoverability of long-form academic content
  • Supports students, early-career researchers, and interdisciplinary exploration
  • All code, data processing pipelines, and models will be openly documented

The project is non-commercial, community-serving/driven, and aligned with Hugging Face’s mission to support open research and accessible AI.

Hi @davidhendriks , we've assigned ZeroGPU to this Space. Please check the compatibility and usage sections of this page so your Space can run on ZeroGPU.
If you can, we ask that you upgrade to Enterprise to enjoy higher ZeroGPU quota and other features like Dev Mode, Private Storage, and more: hf.co/enterprise

UniverseTBD org

Hi @hysts ,

Thank you for allocating us the ZeroGpu mode, it will help with a better experience!

I've tried to set up the codebase to use zeroGpu, but I keep getting the following error after i see a popup of HF allocating Gpus (initially seemingly succesfully, but then the error occurs):

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 143, in worker_init
    torch.init(nvidia_uuid)
  File "/usr/local/lib/python3.10/site-packages/spaces/zero/torch/patching.py", line 414, in init
    torch.Tensor([0]).cuda()
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 410, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

Current I am unsure if this is an infrastructure issue on the HF side, or if I somehow have misconfigured my codebase.

Hi @davidhendriks

The error is caused by calling a function that uses CUDA outside a function decorated with @spaces.GPU.
On ZeroGPU, any code that touches CUDA must run inside a @spaces.GPU-decorated function. Running CUDA-dependent code outside the decorator doesn't always raise an immediate error β€” it can appear to succeed via CPU fallback β€” but it irreversibly corrupts the process-level CUDA state. Once corrupted, all subsequent @spaces.GPU calls will fail with the error.

Since this repo uses a private dataset, I wasn't able to run the app directly to debug. However, I had Claude Code review the codebase and it flagged the SentenceTransformer.encode() call as suspicious β€” it runs outside @spaces.GPU but internally touches CUDA APIs. I extracted that part into a minimal test Space and confirmed the issue reproduces, so I'd suggest wrapping it in a @spaces.GPU-decorated function. There may be other places in the code with the same pattern, but the same fix should apply.

UniverseTBD org

Okay it seems that wrapping those pieces in semantic.py did resolve this issue. I now managed to execute my code, but I am finding the following.

Firstly, that zeroGpu accelarated execution seemingly didn't really accelarate the reranking step. I suppose that was because in that first go it had to load a bunch of things:

entered rerank_results_gpu
torch.cuda.is_available() = True
torch.cuda.device_count() = 1
[rerank.py:73 -   rerank_results_gpu ] 2026-04-08 17:29:40,571: Loading reranker model on GPU...

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]
config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 799/799 [00:00<00:00, 5.31MB/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]
tokenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 443/443 [00:00<00:00, 2.96MB/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]
sentencepiece.bpe.model: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5.07M/5.07M [00:00<00:00, 12.2MB/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]
tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 17.1M/17.1M [00:00<00:00, 42.0MB/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]
special_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 279/279 [00:00<00:00, 1.32MB/s]
Exception ignored in: <function BaseEventLoop.__del__ at 0x7f529a436560>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 690, in __del__
    self.close()
  File "/usr/local/lib/python3.10/asyncio/unix_events.py", line 68, in close
    super().close()
  File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 87, in close
    self._close_self_pipe()
  File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 94, in _close_self_pipe
    self._remove_reader(self._ssock.fileno())
  File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 272, in _remove_reader
    key = self._selector.get_key(fd)
  File "/usr/local/lib/python3.10/selectors.py", line 191, in get_key
    return mapping[fileobj]
  File "/usr/local/lib/python3.10/selectors.py", line 72, in __getitem__
    fd = self._selector._fileobj_lookup(fileobj)
  File "/usr/local/lib/python3.10/selectors.py", line 226, in _fileobj_lookup
    return _fileobj_to_fd(fileobj)
  File "/usr/local/lib/python3.10/selectors.py", line 42, in _fileobj_to_fd
    raise ValueError("Invalid file descriptor: {}".format(fd))
ValueError: Invalid file descriptor: -1
Exception ignored in: <function BaseEventLoop.__del__ at 0x7f529a436560>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 690, in __del__
    self.close()
  File "/usr/local/lib/python3.10/asyncio/unix_events.py", line 68, in close
    super().close()
  File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 87, in close
    self._close_self_pipe()
  File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 94, in _close_self_pipe
    self._remove_reader(self._ssock.fileno())
  File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 272, in _remove_reader
    key = self._selector.get_key(fd)
  File "/usr/local/lib/python3.10/selectors.py", line 191, in get_key
    return mapping[fileobj]
  File "/usr/local/lib/python3.10/selectors.py", line 72, in __getitem__
    fd = self._selector._fileobj_lookup(fileobj)
  File "/usr/local/lib/python3.10/selectors.py", line 226, in _fileobj_lookup
    return _fileobj_to_fd(fileobj)
  File "/usr/local/lib/python3.10/selectors.py", line 42, in _fileobj_to_fd
    raise ValueError("Invalid file descriptor: {}".format(fd))
ValueError: Invalid file descriptor: -1

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/1.11G [00:01<?, ?B/s]
model.safetensors:  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 537M/1.11G [00:02<00:01, 536MB/s]
model.safetensors: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.11G/1.11G [00:02<00:00, 504MB/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]
Loading weights: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 201/201 [00:00<00:00, 27366.59it/s]
XLMRobertaForSequenceClassification LOAD REPORT from: BAAI/bge-reranker-base
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.

When i wanted to give it another go, sadly another error pops up about the quota.

Selection_2026-04-08-163325001

The 'try again in' suggestion doesn't really indicate the reset time I think, but more importantly:

  • Is it expected that 1 query burns through the entire zeroGpu that I have been allocated?
  • What is the reset time for this quota for my allocation?
  • is there a way to have a fallback system? Because in this way the entire query section of the app is blocked off and doesn't work at all.

Is it expected that 1 query burns through the entire zeroGpu that I have been allocated?

The issue is in how the models are loaded in your code. Currently, SentenceTransformer and the reranker are loaded lazily inside @spaces.GPU-decorated functions. This is an anti-pattern for ZeroGPU. I previously mentioned that CUDA-related code must be inside @spaces.GPU, but that only applies to the inference step. Model instantiation should be done at global scope. The spaces package monkey-patches torch so that ZeroGPU's backend automatically moves models to GPU only when a @spaces.GPU-decorated function runs. If you don't load at global scope, this mechanism can't work properly and you end up with unnecessary overhead.

What is the reset time for this quota for my allocation?

Quota resets every 24 hours (see ZeroGPU documentation).

Sign up or log in to comment