Some Rust libraries are like oil and water, they just don't mix. With the async runtime tokio and the data-parallelism library rayon, I learned it the hard way.

Explosion chemistry gif


Adding some async on top of your parallel iterators

Meilisearch v1.6 saw a new iteration of our vector store feature, adding the ability to interact with OpenAI to generate embeddings, rather than requesting that users provide them.

Interacting with OpenAI involves making HTTP requests to its REST API, so out of habit I grabbed reqwest and initialized a new "current thread" tokio runtime.

Using async to send the requests was unnecessary, but a convenience so that I could send multiple requests concurrently without spawning multiple threads, and also easily handle transient error conditions with a nicely composable exponential backoff strategy.

So I went my merry way, not taking into account that this newly created tokio runtime would also be used during the document indexing operation of Meilisearch. The indexing step makes a heavy use of data-parallelism, implemented as a rayon thread pool.

Sure enough, a few weeks later I received...

The Bug Report

A mere 10 days after the release of v1.6.0, and in spite of my testing, we received issue #4361. Not a fun one, a panic that the reporter could not reliably reproduce, and that I'd obviously never ran into. Also, for reasons related to being in a rayon thread pool, the panic would cause Meilisearch to abort, an issue that we finally mitigated in #4593.

After duckduckgoing the panic error message, it became clear that the condition would mostly appear when trying to initialize a runtime from inside a block_on call from another runtime on the same thread.

The "Yo dawg" meme, but with "block_on"

Initially, this left me flabbergasted. My code made two simple calls to block_on inside of a non-recursive function, so it should have been impossible to yo dawg a block_on inside of a block_on.

Impossible, or was it?

The async rayon sandwich

The block_on calls were made inside of the extract_embeddings function itself called inside of a rayon::spawn invocation. rayon::spawn adds its closure argument to rayon's queue of jobs that will eventually be executed in the thread pool.

As a result, we were initializing the tokio runtime and calling block_on from inside a rayon job.

Now, unbeknownst to me, one of our dependencies used to compute embeddings was actually itself using rayon.

As we were making calls to this dependency from inside the block_on, we were actually sandwiching our async calls between rayon calls.

visual representation of the async rayon sandwich

Fair enough, but what could possibly go wrong?

What colour is your thread?

Precisely, this:

  1. Multiple concurrent extraction jobs are spawned via rayon::spawn. Diagram
  2. Thread A from the rayon thread pool picks up an embedding extraction job. Diagram
  3. Thread A calls block_on inside of extract_embeddings. At this point, thread A is two-tone: it belongs to the rayon pool on one hand, and also it is a tokio asynchronous tasks driver on the other hand. Diagram
  4. Inside block_on, thread A calls the dependency, the dependency calls a parallel iterators from rayon, which spawns multiple additional jobs on the rayon job queue. Diagram
  5. While waiting for these additional jobs to complete, thread A yields to the rayon executor. It then steals one of the extraction jobs that were spawned in step (1). Diagram
  6. The extraction job that thread A stole happens to be an embedding extraction job. Thread A call block_on inside of extract_embeddings. But thread A already has a brush of tokio in its colors. Kaboom. Diagram

This outcome required a pretty specific sequence of events to occur, hence why it was difficult to reproduce.

Still the underlying issue is that the same thread belonged both to the tokio and the rayon runtimes, with the yields from rayon conflicting with the thread local state from tokio.

Takeaway

I wish we had a statically verified way of expressing "this thread belongs to rayon already, you cannot use it to start a tokio runtime". The proper way of mixing tokio and rayon is not to: have them communicate via channels instead.

In my case, I first released a quick fix in Meilisearch v1.6.1 that was making sure that we would not call rayon from block_on, getting rid of the "async rayon sandwich".

Then eventually, as async was merely a convenience to send HTTP requests, I rewrote the code to use the async-less ureq instead.