Beyond Semantic Similarity

(arxiv.org)

53 points | by 44za12 10 hours ago ago

12 comments

kgeist 9 hours ago
When I implemented retrieval in our production system a few months ago, one of the most important benchmarks was cross-language retrieval (query in one language, documents in another), which is a common situation in large enterprises (headquarters + branches). I suspect their idea will perform poorly if the source language and the target language are too different from one another, like English and Hindi (grep often will not return anything).
Another requirement was keeping latency as low as possible (we managed to get < 5 seconds with 85%+ accuracy). Their approach seems to have very unpredictable latencies, sometimes up to thousands of seconds (may be fine for background tasks), and it scales poorly with corpus size.
Interesting research anyway, but I'd still stick with embedding/reranker-based retrieval (+BM25 for hybrid search) because you do not waste time wandering around blindly each time, trying to find the minimal context to start from, which could have been found immediately with an index. Another issue is that research papers often implement subpar baselines for the approaches they compare against. When I was implementing retrieval, the straightforward implementation gave me 40% accuracy, and various tricks/parameter tuning pushed it to 85%+ without changing the overall architecture (took about a month of experimentation).
[-]
- aubreypc 7 hours ago
  Would you mind sharing any lessons learned / which parameters you were experimenting with? I'm working on a Vespa hybrid lexical + HNSW retrieval system at the moment with quite a large corpus (1B+ vectors), so I'd be quite interested to hear what worked well for others.
  [-]
  - kgeist 4 hours ago
    I experimented with chunking strategies (how to split, what size chunks should be, how much they should overlap, etc.), query rewriting (one query produces several subqueries to explore different possibilities/search paths in parallel), the number of items at each stage (how many documents to retrieve at the embedding stage vs. the reranker stage), what weights to use for BM25 vs. vector search (i.e. what influences the hybrid score more), how to merge subresults from different parallel paths, etc.
    It was tuned for a specific set of open-source models we run ourselves on our own GPUs, so I can't share exact golden numbers (for example, if I replace those small models with Claude Haiku+Cohere Embed, the results get worse). A proper reranker helped tremendously because it removed noise; BM25 helped a lot too, because in many cases you want exact-match searches instead of fuzzy/vector search (so again, less noise).
    For small open-source models (we used them because we wanted speed), prompt engineering mattered too, especially in cross-language benchmarks where the model may get confused about which language it should respond in (the system prompt's language, the user query's language, or the documents' language). What mattered was even the order of fields in the output JSON schema (in intermediate steps), because LLMs are autoregressive: if you order the fields incorrectly, the model may guess or hallucinate during extraction when the first value in the schema can't be extracted reliably without first considering other dependent fields that should've been extracted earlier (we don't use reasoning models to save on speed).
    I used LLM-as-a-judge to quickly figure out what improved scores and what didn't. Then humans tested it manually too and calculated scores to see whether their scores diverged from the machine's scores. I think if I had to do it again, I probably would use an agent (like autoresearch) to autonomously find the best configuration for the exact set of models via intelligent bruteforce (dunno if it would work, but interesting to try).
    We don't have 1B+ vectors; our system is split into tenants (organizations), a single tenant usually doesn't have that many vectors, plus every document in the system has a specific hierarchical structure, so your mileage may vary
- dominotw 6 hours ago
  > which is a common situation in large enterprises
  how was this done before llms and ai? can you share some examples of these documents.
ozim 9 minutes ago
I hate that some people started to implement semantic search everywhere without option for user to opt out. I want fuzzy search first then I might switch to semantic search if I feel like it. If I type a phrase and get results that don’t match it in any way, I am getting annoyed.
Rant off. Not really related to the article.
HarHarVeryFunny 7 hours ago
It depends on your data, as well as what you are trying to optimize for: speed, cost, precision, etc.
In many cases cheap methods like grepping and BM25 just are not going to work well, so semantic similarity is the best initial retriever/filter, followed by LLM-as-judge as a second filter/reranker if you need the precision.
bob1029 5 hours ago
In terms of "direct" retrieval strategies, the most powerful approach I've seen so far involves using git.exe. The available commands like grep/log/show/blame/diff/reflog/bisect cover a lot of ground. I've been able to dramatically simplify large parts of my agent harnesses with this realization. Anything that can be stored in text files is suitable.
t10gyal 5 hours ago
I think it works great for most cases. For example, everyone at work uses coding agents and I never heard anyone say that the agent missed something in terms of retrieval (grepping in this case).
But current IR methods both lexical and semantic retrieval definitely have bottlenecks as pointed out in the the obliq-bench paper (https://arxiv.org/abs/2605.06235).
andyg_blog 2 hours ago
This reminds me a bit of "Command-line Tools can be 235x Faster than your Hadoop Cluster" from 2014 https://adamdrake.com/command-line-tools-can-be-235x-faster-...
The constraint I, and I bet many here, have is just how much data there is. 3GB like in the 2014 article is one .pdf
Enterprise level data store is measured in hundreds of GB for a single customer, and you'll get murdered on data egress costs if you try to search an entire corpus, if you can even get through it all before the request times out or the customer decides after 5 minutes that enough is enough.
You'd need a true distributed filesystem to even start attempting what the authors suggest at any scale outside of your local machine.
2001zhaozhao 7 hours ago
Maybe the best "index" will just be markdown files fed into a tiny LLM model.
Is anyone using small, low-latency, fast LLMs to implement stuff like search as a RAG alternative? Could be the perfect use case for that Llama3 8B ASIC some company showed off a few months ago.
efskap 8 hours ago
Makes sense that the agent can refine its search terms/strategy based on discovered context.
But it still has to enumerate synonyms to find things.
I would assume it's very domain dependent, like code or technical docs would have more precise terminology that is better for fixed string search. On the other hand, medical or legal text can have many many ways to say something
nivekney 8 hours ago
Map-reduce as a pattern might be on its way back. Hear me out. High localization wins even when coverage is not super great -- just map shards of the corpus and reduce the learnings. Rinse and repeat, do as many rounds of map and reduce to traverse the corpus until converge. This can also work well when the cluster is combined with different agents, they are tasks equally by prompts anyway.