Naming is hard

Naming is hard. After all, it is one of the two hard problems of computer science, with cache invalidation and off-by-one errors.

Mostly we would assume that poorly chosen type and variable names would merely make the code harder to read, but turns out they can lead to undefined behaviour, too. And no, I'm not talking about the reserved identifiers in C (I mean, who would use _Foo or __init__ as identifiers? Nonsense!), but rather something a bit more subtle.

When providing an API, if your types are named like types that usually own their data (Vector, String, Map, ...) then users of your API will use your types like types that own their data. When the types are actually reference types that don't own their data, this leads to lifetime errors, either compile time or hilariously runtime, depending on your language of choice. You would think that users know better and that documentation suffices, but listen to this curious example.

Cap'n Proto is an IPC format and RPC system. Think protobuf, but without encoding and decoding of the data in messages. Data is obtained through pointer chasing in the message. Cap'n Proto provides the best interface language usable for cross language RPC, in my opinion. I chose it to replace (using the strangler fig pattern) our then very legacy network layer that was providing communication between our Python API client and our C++ server.

So I embedded the legacy network layer inside the new Cap'n Proto layer, wrote a minimal "pure capnp" service, a first feature benefitting from the new layer, provided documentation and quick starts for my team, and several weeks later, a colleague came to see me with a dismayed look on his face.

"So I tried to use the new stack, but there's an issue in my service. It returns a list of stuff, and when the list contains only a few elements, it works as expected, but when the list is big, the client segfaults. Can you have a look at the code?"

Uh, segfault. Doesn't bode well for Cap'n Proto. Surely the issue lies with our code. So let's take a look indeed. My colleague had written a small interface file for his service, that we could simplify to:

@0x870928ae94aa7710;

interface Stuff {
    items @0 (size: UInt64) -> (list :List(UInt64));
}

A simplified C++ server implementation:

#pragma once

#include <cstdint>

#include "stuff.capnp.h"

class StuffImpl final : public Stuff::Server
{
  auto items(ItemsContext ctx) -> ::kj::Promise<void> final
  {
    auto size = ctx.getParams().getSize();
    auto list = ctx.getResults().initList(size);
    std::uint64_t i = 0;
    for (auto i = 0; i < size; ++i) {
      list.set(i, i); // just put `i` in the element of index `i`
    }
    return kj::READY_NOW;
  }
};

And then a simplified C++ client calling the service:

#include <cstdint>
#include <vector>

#include <capnp/ez-rpc.h>
#include <fmt/core.h>

#include "stuff.capnp.h"

constexpr int read_size = 0xffff;

auto main() -> int
{
  capnp::EzRpcClient client("127.0.0.1:8000");
  Stuff::Client stuff = client.getMain<Stuff>();
  auto& wait_scope = client.getWaitScope();
  auto request = stuff.itemsRequest();
  request.setSize(read_size);
  const capnp::List<std::uint64_t>::Reader list =
      request.send().wait(wait_scope).getList();
  std::vector<std::uint64_t> items;
  items.reserve(list.size());
  for (auto i : list) {
    items.push_back(i);
  }
  for (auto i : items) {
    fmt::print("{}\n", i);
  }
}

Running the main with read_size = 0xff does not segfault, but returns wrong values after a while:

# ...
242
243
244
245 # values are normal up to this point
1057 # hu?
140479108999968 # wtf?
140479108999968
0 # wtf wtf???
0
251 # let's resume.
252
253
254

Running with read_size = 0xffff; causes:

[1]    63094 segmentation fault (core dumped)  ./client

Perhaps you already have it, taking into account the context of the article, but at the time the cause completely eluded me, after a cursory read of the code.

So I turned to tooling.

CodeChecker (clang-tidy, clang static-analyzer, cppcheck)

All three static analyzers reported success, no report created.

Address sanitizer

No longer segfaults, displays correct result. No error reported though.

UB sanitizer

Segfaults, no error reported.

valgrind

Oh! as often, valgrind is the most competent at actually finding an issue (I love valgrind ❤️... btw, valgrind rhymes with "tinned", not "find"), let's see what it found:

==58150== Invalid read of size 8
==58150==    at 0x10C5EE: capnp::_::DirectWireValue<unsigned long>::get() const (endian.h:77)
==58150==    by 0x10D8FB: unsigned long capnp::_::ListReader::getDataElement<unsigned long>(unsigned int) const (layout.h:1212)
==58150==    by 0x10CF58: capnp::List<unsigned long, (capnp::Kind)0>::Reader::operator[](unsigned int) const (list.h:116)
==58150==    by 0x10C187: capnp::_::IndexingIterator<capnp::List<unsigned long, (capnp::Kind)0>::Reader const, unsigned long>::operator*() const (list.h:56)
==58150==    by 0x10A57B: main (client.cpp:22)

OK, so something was up with that list, a use-after-free. Unfortunately while valgrind was dutifully reporting the point where the list's data had been freed, the backtrace only linked to the internals of Cap'n Proto. Why was it freed?

That list was part of a type generated by Cap'n Proto from the interface file, so I had to forage through its generated header to find it. It ultimately boiled down to a kj::ArrayPtr, which was fortunately documented to be a non-owning view of data. So, the list itself hadn't been freed, rather it was referring to data from the response! The response data was freed in the chain of calls .send().wait().getList()

Knowing this, the fix was simple:

auto main() -> int
{
  capnp::EzRpcClient client("127.0.0.1:8000");
  Stuff::Client stuff = client.getMain<Stuff>();
  auto& wait_scope = client.getWaitScope();
  auto request = stuff.itemsRequest();
  request.setSize(read_size);
  auto response = request.send().wait(wait_scope); // 👈 ADDED
  const capnp::List<std::uint64_t>::Reader list =
      response.getList();
      // ☝️ CHANGED
  std::vector<std::uint64_t> items;
  items.reserve(list.size());
  for (auto i : list) {
    items.push_back(i);
  }
  for (auto i : items) {
    fmt::print("{}\n", i);
  }
}

By putting the response in an intermediate local variable, we make sure its lifetime lasts for at least as long as the list object, which indeed references the data from the response.

Now the surprising part with this bug is how recurring it became, despite us knowing very well it could happen.

Oops I did it again

Sometime later, I wrote a Rust client for the same service as a proof-of-concept of cross-language capabilities for Cap'n Proto.

I wrote:

/// Most of the boilerplate retrieved from
/// <https://github.com/capnproto/capnproto-rust/blob/master/capnp-rpc/examples/hello-world/main.rs>
/// see their license

pub mod stuff_capnp {
    include!(concat!(env!("OUT_DIR"), "/stuff_capnp.rs"));
}

use crate::stuff_capnp::stuff;
use capnp_rpc::{rpc_twoparty_capnp, twoparty, RpcSystem};

use futures::AsyncReadExt;

const READ_SIZE: u64 = 0xFFFF;

#[tokio::main(flavor = "current_thread")]
pub async fn main() -> Result<(), Box<dyn std::error::Error>> {
    tokio::task::LocalSet::new()
        .run_until(async move {
            let stream = tokio::net::TcpStream::connect(&"localhost:8000").await?;
            stream.set_nodelay(true)?;
            let (reader, writer) =
                tokio_util::compat::TokioAsyncReadCompatExt::compat(stream).split();
            let rpc_network = Box::new(twoparty::VatNetwork::new(
                reader,
                writer,
                rpc_twoparty_capnp::Side::Client,
                Default::default(),
            ));
            let mut rpc_system = RpcSystem::new(rpc_network, None);
            let stuff: stuff::Client = rpc_system.bootstrap(rpc_twoparty_capnp::Side::Server);

            tokio::task::spawn_local(rpc_system);

            let mut request = stuff.items_request();
            request.get().set_size(READ_SIZE);

            let list = request.send().promise.await?.get()?.get_list()?;

            let v: Vec<u64> = list.iter().collect();

            for i in v {
                println!("{i}");
            }

            Ok(())
        })
        .await
}

Doing so, I accidentally reproduced the issue my colleague had run into... can you spot it?

Now, Rust is not C++, fortunately it is memory safe, so the compiler helpfully said:

❯ cargo check
    Checking cxxnproto v0.1.0 (/home/dureuill/exp/rust/cxxnproto)
error[E0716]: temporary value dropped while borrowed
  --> src/main.rs:34:24
   |
34 |             let list = request.send().promise.await?.get()?.get_list()?;
   |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                   - temporary value is freed at the end of this statement
   |                        |
   |                        creates a temporary value which is freed while still in use
35 |
36 |             let v: Vec<u64> = list.iter().collect();
   |                               ---- borrow later used here
   |
help: consider using a `let` binding to create a longer lived value
   |
34 ~             let binding = request.send().promise.await?;
35 ~             let list = binding.get()?.get_list()?;
   |

For more information about this error, try `rustc --explain E0716`.
error: could not compile `cxxnproto` (bin "cxxnproto") due to previous error

It even provided the fix 😅. I would then go and do it again in a different service, and have another colleague do it too. All while knowing that it could happen...

So, apart from the fact that switching from C++ to Rust reduced the debugging session from two hours of two engineers to 10 seconds of a single one, what is the moral of this story?

Because the Cap'n Proto type is named List, we assume that it refers to owned data and make wrong assumptions about lifetimes. This is compounded by the fact that other Cap'n Proto types (integers, boolean, and some Text) are immediately copied from the message and so we don't see the reference semantics when using them.

Is this on Cap'n Proto? Honestly, I don't know. On the one hand, the code documents that the type is a view, and the generated method is named after the user-provided method name in the interface file. When you think about it, it is also perfectly normal that the list doesn't own its data, as doing so would incur a copy from the data of the received message to the List object (or some fashion of reference counting). On the other hand, I heard from a friend the other day, who works also in C++ albeit in a totally different field, that he was happening to use Cap'n Proto as well. Can you guess which bug his team ran into?