Saturday, June 7, 2014

Hacking on Receive Side Scaling (RSS) on FreeBSD

RSS is a Microsoft invention that tries to keep a given TCP or UDP flow (and I think IP, but I haven't yet tried that) on a given CPU core. The idea is to try and keep both flow-local data and flow-local locking on a single CPU core, increasing the chances that data is hot in the CPU core cache and reducing the chance of lock overhead.

You can find the RSS overview and programming details here:

http://msdn.microsoft.com/en-us/library/windows/hardware/ff567236(v=vs.85).aspx

RSS and supporting technology has been making its way into FreeBSD for quite some time but it's not in any real shape that application developers can take advantage of.

Firstly, there's "PCBGROUPS", which looks to group PCB (protocol control block) data for a connection local to a CPU. Instead of there being one global PCB table for the system (well, VIMAGE for FreeBSD - each virtual image instance has its own PCB table) with one lock protecting it, there's now multiple PCB tables, one per "thing". Here, the thing is whatever the kernel developer thinks is worth grouping them by.

http://www.ece.rice.edu/~willmann/pubs/paranet_usenix.pdf

Now, until the RSS work went in, this code was in FreeBSD but sat unused. A kernel developer could provide the hooks needed to map TCP (and maybe UDP later) flows to a "thing" and have that map to a PCB group table - but it required some glue to stamp incoming connections and outgoing packets with some identifier (which we call a "flowid" in FreeBSD) with something that can map to said "thing". Then whenever a PCB lookup was needed, it would first try the lookup in the table mapped to by the mapping between the "flowid" and "thing" - if it was successful, it wouldn't have to use the global PCB table to do the lookup.

This is only good for established connections - creating and destroying a connection still requires manipulating that global PCB table and the single PCB table lock. I'm going to ignore fixing that for now, as that is a bigger issue.

Then Robert Watson added the RSS work done under contract to Juniper Networks, Inc. RSS provides one kind of mapping between the flowid from the NIC and which CPU to run work on. So that part worked great - but there wasn't any way for the application user to take advantage of it. Additionally, there's no driver awareness of it yet - I'll discuss this shortly.

So I grabbed a bunch of this work whilst at Netflix and tried to make sense of it. It turns out that if you can keep the work local to a CPU, a lot of the lock contention in the networking stack melts away. Here's what's going on:

  • The receive thread(s) in the NIC driver processing packets are typically doing direct dispatch to the network stack - so they're running the receive side of the TCP stack;
  • .. and the receive side of the network stack includes ACKs, which triggers the transmit side of the network stack;
  • There's typically some deferred thread(s) in the NIC driver transmitting packets to each NIC queue;
  • There's also application threads trying to queue data to the TCP socket, which also can dig into the socket and TCP stack state, which involves grabbing locks;
  • And there's also timers firing to update state, and doing this involves grabbing locks.
Without RSS and without lining everything up on CPU cores, all the above can run on different cores. Whenever any of them try running at the same time, lock contention can occur and that particular task can stop. If the lock contention blocks the transmit or receive NIC threads, then not only is that connection affected - the whole NIC processing is affected.

There's still lock contention in the network stack - especially if you're doing a lot of new, short connections. The good folk at Verisign are working on that particular corner of the problem so I'm happy to defer to them.

So, I ended up doing a bunch of little pieces to get this lined up right:
  • The per-CPU timer callwheels can now be optionally pinned to their CPU cores, so timer events running on CPU X actually do run on CPU X (yes, that was amusing to find..);
  • There's support in the TCP stack for per-CPU timers, but it's not enabled by default;
  • ... and it also didn't query RSS, netisr or anything to figure out how to map a flowid to a given CPU to run a timer on;
  • Then to make matters worse, incoming TCP sessions didn't have a flowid assigned to the PCB until after the first data packet was read - which meant that the initial timer work would all assume CPU 0 and any queries on that particular PCB would return flowid=0 - so it would not find it in the right PCBGROUP.
So those are fixed in FreeBSD-HEAD. The per-CPU TCP timer and pinned-CPU timers aren't enabled by default - I'll only flip that on when I'm confident that the RSS stuff is working.

So that lets all the RSS stuff correctly work. But there wasn't a nice way to query the per-connection flowid or RSS information. So I then extended netstat to have 'R' as a flag - it returns the flowid and the flowid type. I'll add RSS information once I have a nice way to extract it out in bulk. It's still a good diagnostic tool to ensure that the IPv4/IPv6 hashing is working correctly.

Then I had to teach a driver about RSS so I could actually test it all out. I have some igb(4) hardware at home, so I did the minimal work required to teach it about the RSS key and assigning things to the correct CPUs. It's still incomplete but it's good enough to get off the ground. I'll go into more details about the driver requirements in a follow-up blogpost.

Finally, how are application developers supposed to use it? I'll cover that particular bit in another follow-up blog post as there's quite a lot to cover there.