Medical records often contain scanned text, whether they are physical documents scanned into the system or digital documents from which text cannot be extracted. When building a search engine for medical documents that uses embeddings for retrieval (as part of a larger GenAI application), you have two options:

  1. OCR documents to plain text and create an embedding from text
  2. Use a new class of multimodal models to embed the raw scanned image

nemotron-colembed was released last month and I wanted to see how well it handles raw scanned medical text compared to the OCR approach and to non-late-interaction models. Late interaction models store one embedding per token instead of a single embedding for the whole text, so you pay more on storage but gain more accuracy. These models are becoming popular especially for visual embedding tasks, as you can see in the ViDoRe benchmark published in the ColPali paper.

This post describes my experiments evaluating these models in this domain, including the iterative process of creating my (tiny) dataset and LLM-judge. All the code to reproduce is available at github.com/yonigottesman/visual-retrieval.

Problem Definition

The problem I’m solving is of a physician asking questions on a patient with thousands of documents, and the search engine should return the documents containing the answer. For example, a question could be “What is the patient’s LVH measurement value?” and the relevant document should contain the answer.

Synthetic Dataset

Queries

I don’t have any real production traces of user queries, and there is no public dataset for what I’m looking for, so I need to create a synthetic dataset. The common approach is giving an LLM a document and asking it to generate questions. I’m using the mimic-iii dataset for the notes and started with the ViDoRe-V3 prompt for generating synthetic queries, iterating on patient documents and sending the doc + prompt to gemini-3-flash to generate physician queries.
The vanilla prompt worked poorly. The generated queries were garbage and didn’t look like what I wanted. I needed a way to quickly review the generated queries, so I asked Claude Code [CC] to create a function that takes all the generated queries and creates a single-page HTML so I can review and save only relevant queries. Here is the generated HTML:


At first I thought I would use this tool to manually pick only the best queries, but the initial ViDoRe prompt didn’t produce any good ones. I needed to improve the query generation prompt! Instead of manually changing the prompt, running the full generation, and viewing the results, I gave CC this iteration task so it could iterate a few times before I reviewed the results. I gave CC this meta-prompt:

You are helping create synthetic queries for a medical document retrieval system... The questions should be from a perspective of a physician trying to gather information on the patient. optimize prompt slowly and rerun check results and repeat. once you feel its better show me again. either way dont do more than 10 iterations without showing me.

Here are some good and bad examples:

GOOD: What bilirubin level prompted the initiation of double phototherapy for the infant? — general physician perspective
BAD: How soon after leaving the hospital will the follow-up home visit occur? — too administrative, a doctor wouldn’t ask this on a full patient chart

In each iteration, CC tweaked the prompt used by Gemini to generate queries, ran the full query generation script, reviewed the resulting queries, and fixed the Gemini prompt again. I continued iterating with CC, gave it more good and bad examples, and let it iterate on its own. I ended up splitting the queries into three types:

EASY - An easy query uses the EXACT SAME words, abbreviations, and terminology that appear in the document. Every key term in the query must appear verbatim in the document text. A simple keyword search (ctrl+F) on the document would match the important words in the query.

MEDIUM - A medium query asks the SAME KIND of simple, direct clinical question as an easy query, but makes small wording changes so that a simple ctrl+F keyword search would NOT match. Specifically: - Expand abbreviations to their full form (e.g. “bili” → “bilirubin”, “resp” → “respiratory”, “abx” → “antibiotics”, “dopa” → “dopamine”) - Or swap a word for a common synonym (e.g. “feeding” → “nutrition”, “meds” → “medications”, “labs” → “laboratory values”)

HARD - A hard query rephrases or uses medical synonyms so that a retrieval system needs SEMANTIC understanding, not just keyword matching, to find the answer. The wording in the query should differ from the wording in the document.

See the generation prompt and the HTML above to get a feel for how the types differ. I can now run generate_queries.py which iterates all patient documents, sends them to Gemini with the generation prompt, and generates all the queries I want. I added a dedup step since these documents contain lots of redundancy.

Scanned Docs

I’m using mimic-iii which contains text documents, so to transform them into scanned PDFs I add some noise and render the text as a “scanned” PDF. create_pdfs.py iterates all patient documents and creates a PDF version of each. Here is an example text and PDF:

Original text
Scanned PDF

OCR

To compare visual embedding on the scanned PDF with the straightforward OCR + text embedding approach, I first need to run OCR on the scanned PDFs I just created. My CLAUDE.md has instructions for Claude to spawn a vllm instance with deepseek-ai/DeepSeek-OCR running on it. Once the instance is running, I run batch_ocr.py which takes all the scanned PDFs, performs OCR on each, and writes the text result.

Final Dataset

Now I have the dataset I’m going to work with. I have a queries.json with a list of queries, answers, and the document each was taken from. For each document there are three versions:

  1. original text
  2. scanned pdf (image)
  3. OCR text

All documents are from a single patient, patient_id 16118 from the mimic-iii dataset. I chose this patient because he had many notes of different types. In total I have 1149 queries (441 easy, 434 medium, and 274 hard) across 727 documents. I removed documents that were too long (exceeding 1 page as a PDF) and documents too short (just a few words).

Creating Embeddings

The next step is to create embeddings for all document types and queries with the different models I want to evaluate. The models I am going to compare are:

Ideally I would deploy a vllm instance on my gcloud for all these models and run everything against that. But unfortunately vllm does not support late-interaction models for image inputs yet. Also, for the other models I found some discrepancies between the original HuggingFace examples and running through vllm.
All embeddings were instead computed with the standalone scripts in generate_embeddings running on a GPU instance.

Evaluation

The moment of truth! How good is each embedding model at retrieving relevant medical documents? After generating all the embeddings for documents and queries, I created for each model a similarity.npz containing the similarity between each query and all documents. For regular embeddings I use cosine similarity and for colembed I compute maxsim similarity.

Now, given similarity.npz from the previous step, I want to compute standard retrieval metrics:

  • hit - Fraction of queries with at least one relevant document in top-K
  • precision - Average proportion of relevant documents in top-K
  • ndcg - Normalized Discounted Cumulative Gain
  • mrr - Mean Reciprocal Rank (1/rank of first relevant document)

The thing about these medical documents is that data is repeated again and again, so I cannot easily compute recall because I would need to know for each query the full list of relevant documents. That’s why I’m sticking to metrics where all I need is the top-K list and whether each result is relevant or not.

LLM as a Judge

Given a list of top-K results per query, I need to know which of the top-K documents actually answer the query. I can either manually annotate the results or give an LLM a (query, document) tuple and ask whether the document answers the query. To get a good judgment prompt I used the same iterative technique with CC I used for the query generation prompt. I had CC run Gemini with different queries and documents (some relevant, some not) and let it optimize the prompt by validating that the judge returned the correct True/False for each. The resulting prompt is:

You are a medical information retrieval evaluator. Given a clinical query and retrieved documents, determine if each document contains information that directly answers or is highly relevant to the query.

A document is RELEVANT if:

  • It contains specific information that answers the query
  • It discusses the medical concept, condition, or treatment asked about in a way that addresses the query
  • The answer to the query can be found or inferred from the document content

A document is NOT RELEVANT if:

  • It is about the same patient but does not address the specific question
  • It mentions related medical terms but does not contain the answer
  • It is a completely unrelated document

Query: {query}

Documents: {documents}

For each document, provide your judgment as JSON with a “judgments” array. Each item must have doc_id, relevant (boolean), and reason (brief explanation).

Results

Nemotron on scanned images is clearly the best across difficulties and most metrics. It’s surprising (to me) to see text retrieval perform best when the text is actually represented as image pixels. I created the widget below to show top-K results for all models and difficulties to easily debug errors.


Here are some interesting failing queries.
Nemotron scanned
“What medication was prepared to counteract the adverse effects of morphine?” - All top results contain “morphine”. This is a hard question to capture in an embedding. The embedding somehow needs to represent both “morphine” and “counteract the adverse effects”. This seems really hard to me.

Nemotron OCR
“What was the infant’s blood sugar reading after receiving insulin?” - Another hard embedding example. The embedding should capture “blood sugar” and “after receiving insulin”. The top result does contain “insulin” but does not mention blood sugar. Interestingly, the scanned version actually did manage to capture this connection, which is really cool given this is not trivial. Looking at the document it’s not straightforward to capture this connection, but Nemotron scanned does. Here is the top note for the scanned result:

...
GI/GU: Abdomen soft, flat, no loops. +BS. Unable to palpate testes in canal.
Genitilia appropriate for gestation. Voiding approximately 5.2cc/kg/hr.
UA as noted in labs.
Note trace glucose but have had increased glucose levels (see FEN).  No stools reported.
FEN: TF 220 now decreased to 200cc/kg/day.
IVF decreased from D10 to D5W.  Last glucose 11am 211.
Decreased from 290- insulin 0.05units given.  Starting TPN and IL today.  Lytes 149/3.6/115/23.  Recheck at 9pm.  UAC rate decreased from 1 to 0.8 to decrease Na amt infant receiving to 2.8 through this line.
...

Final Thoughts

Nemotron is a really good model (according to this modest benchmark), but before jumping to use it there are some things to consider. Late interaction models take 10x more space in your search engine because we store an embedding per token. Also, it’s a big model; 8B is not trivial and must run on a GPU.
I think the tiny gemma model has great results too. Hit@10 is above 90% across all difficulties, which is great given I usually need just a single document that contains the answer. If I add BM25 text search and some query expansions done by an agent using retrieval as a tool, it might actually be very cost effective. A text model does require OCR, but I think OCR for plain text is essentially solved. Also, having a pipeline solution makes it easier to debug failures than a single end-to-end embedding model.