Embedding 500K Medical Texts with PubMedBERT on RunPod
I recently needed to generate embeddings for over half a million paragraphs of medical/biomedical text. Running this on my laptop would have taken days, so I turned to RunPod for on-demand GPU compute. Here's what I learned, including the gotchas that cost me hours of debugging.
Why PubMedBERT?
When embedding domain-specific text, generic models like all-MiniLM-L6-v2 leave performance on the table. PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) is pre-trained exclusively on biomedical literature—PubMed abstracts and full-text articles from PubMed Central.
What Makes It Different
| Aspect | Generic BERT | PubMedBERT |
|---|---|---|
| Training data | Wikipedia, books | PubMed abstracts + PMC full texts |
| Vocabulary | General English | Biomedical terminology |
| Domain terms | Often split into subwords | Preserved as single tokens |
A term like "cardiomyopathy" might be tokenized as ['card', '##io', '##my', '##op', '##athy'] by generic BERT, but PubMedBERT recognizes it as a single token. This matters for downstream tasks like semantic search, clustering, and classification.
When to Use PubMedBERT
- Medical literature search and retrieval
- Clinical note analysis
- Drug/disease relationship extraction
- Biomedical research clustering
- Grant abstract similarity matching
The model outputs 768-dimensional vectors, which is standard for BERT-base models. You can store these in PostgreSQL with pgvector, Pinecone, Qdrant, or any vector database.
Why RunPod?
I needed to embed ~500,000 text passages. On CPU, this would take 20+ hours. On an RTX 4090, it takes under 2 hours.
RunPod offers:
- Pay-per-use GPU instances — no commitment, pay by the minute
- Pre-configured PyTorch templates — CUDA drivers pre-installed
- SSH access — work in your familiar terminal environment
- Persistent volumes — keep model weights cached between sessions
Cost Breakdown
For my 500K embedding job:
| GPU | Hourly Rate | Total Time | Total Cost |
|---|---|---|---|
| RTX 4090 (24GB) | ~$0.44/hr | ~1.5 hrs | ~$0.66 |
| RTX 3090 (24GB) | ~$0.31/hr | ~2 hrs | ~$0.62 |
| A100 40GB | ~$1.09/hr | ~1 hr | ~$1.09 |
The RTX 4090 hits the sweet spot for this workload—fast enough that you don't need A100 pricing.
Setting Up RunPod
1. Launch an Instance
- Create a RunPod account and add credits
- Go to Pods → Deploy
- Select RTX 4090 (24GB VRAM)
- Choose template: RunPod PyTorch 2.1
- Set volume size: 20GB (for model cache)
- Deploy
The instance takes 1-2 minutes to initialize.
2. Connect via SSH
Click Connect in the RunPod dashboard to get your SSH command:
ssh root@<IP> -p <PORT>
Or use VS Code's Remote-SSH extension for a better experience.
3. Install Dependencies
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install PyTorch with CUDA support
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
# Install sentence-transformers
pip install sentence-transformers psycopg2-binary tqdm
4. Verify GPU Access
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"
Expected output:
CUDA: True, GPU: NVIDIA GeForce RTX 4090
The Embedding Script
Here's a minimal example that batches efficiently:
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import numpy as np
# Load model (downloads ~500MB on first run)
model = SentenceTransformer('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')
def embed_texts(texts: list[str], batch_size: int = 64) -> np.ndarray:
"""Embed texts in batches with progress bar."""
embeddings = []
for i in tqdm(range(0, len(texts), batch_size), desc="Embedding"):
batch = texts[i:i + batch_size]
batch_embeddings = model.encode(
batch,
show_progress_bar=False,
convert_to_numpy=True
)
embeddings.append(batch_embeddings)
return np.vstack(embeddings)
# Example usage
texts = ["Patient presents with acute myocardial infarction...", ...]
vectors = embed_texts(texts, batch_size=64)
print(f"Shape: {vectors.shape}") # (n_texts, 768)
Optimal Batch Size
| GPU VRAM | Recommended Batch Size |
|---|---|
| 8GB | 16 |
| 16GB | 32 |
| 24GB | 64 |
| 40GB+ | 128 |
If you hit CUDA OOM errors, reduce batch size.
Gotchas That Cost Me Hours
1. PyTorch Version Requirements (CVE-2025-32434)
The error:
ValueError: Due to a serious vulnerability issue in `torch.load`, even with
`weights_only=True`, we now require users to upgrade torch to at least v2.6
The problem: Recent versions of transformers and sentence-transformers require PyTorch 2.6+ due to a security vulnerability. The older CUDA indexes (cu121, cu124) only have PyTorch 2.5.x.
The fix: Use the cu126 (or cu130) index:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
2. IPv6 vs IPv4 (Supabase/Cloud Database)
The error:
psycopg2.OperationalError: connection to server at "db.xxx.supabase.co"
(2600:...), port 5432 failed: Network is unreachable
The problem: RunPod instances are IPv4-only, but many cloud database providers (Supabase, some AWS RDS configs) use IPv6 for direct connections.
The fix: Use a connection pooler that supports IPv4. For Supabase, this means using the "Shared Pooler" connection string instead of the direct connection:
# Instead of this (IPv6):
postgresql://postgres:[email protected]:5432/postgres
# Use this (IPv4 compatible):
postgresql://postgres.xxx:[email protected]:6543/postgres
3. Statement Timeouts on Large Queries
The error:
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout
The problem: Connection poolers often have default statement timeouts (e.g., 60 seconds). Querying 500K rows exceeds this.
The fix: Set a longer timeout at session start:
cursor.execute("SET statement_timeout = '300s'") # 5 minutes
4. SSH Disconnects Kill Your Job
The problem: Long-running jobs die when your SSH connection drops.
The fix: Use tmux to persist sessions:
# Install tmux
apt-get update && apt-get install -y tmux
# Start a named session
tmux new -s embed
# Run your script
python embed.py
# Detach: Ctrl+B, then D
# Reattach later: tmux attach -t embed
5. Model Download Timeouts
The problem: The first model load downloads ~500MB from HuggingFace, which can timeout.
The fix: Pre-download before your main script:
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')"
Or use huggingface-cli:
pip install huggingface_hub
huggingface-cli download microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
Performance Tips
1. Monitor GPU Utilization
In a separate terminal:
watch -n 1 nvidia-smi
You should see 80-95% GPU utilization. If it's lower, your batch size might be too small or you're bottlenecked on data loading.
2. Commit to Database in Batches
Don't commit after every row. Batch your database writes:
COMMIT_BATCH = 500
buffer = []
for text, embedding in generate_embeddings():
buffer.append((text, embedding))
if len(buffer) >= COMMIT_BATCH:
insert_batch(buffer)
conn.commit()
buffer = []
# Don't forget the final batch
if buffer:
insert_batch(buffer)
conn.commit()
3. Use Direct PostgreSQL for Bulk Inserts
ORMs and REST APIs add overhead. For bulk operations, use psycopg2 directly with execute_values:
from psycopg2.extras import execute_values
def insert_embeddings(cursor, data):
execute_values(
cursor,
"""
INSERT INTO embeddings (id, vector)
VALUES %s
ON CONFLICT (id) DO NOTHING
""",
data
)
Final Numbers
For my 500K medical text embedding job:
| Metric | Value |
|---|---|
| Total texts | 506,205 |
| Model | PubMedBERT |
| Embedding dimension | 768 |
| GPU | RTX 4090 |
| Batch size | 64 |
| Total time | ~1.5 hours |
| Total cost | ~$0.66 |
| Throughput | ~5,600 texts/minute |
Conclusion
RunPod makes it trivially easy to spin up GPU compute for batch jobs. The key lessons:
- Use domain-specific models — PubMedBERT significantly outperforms generic models on biomedical text
- Watch your PyTorch version — security patches broke older CUDA indexes
- Check IPv4/IPv6 compatibility — cloud instances often don't support IPv6
- Use tmux — don't let SSH disconnects kill hours of work
- Batch everything — GPU encoding, database commits, all of it
The whole job cost less than a dollar and took under two hours. Compare that to days on CPU or hundreds of dollars for always-on GPU instances.
Have questions or found other gotchas? Feel free to reach out.