Friday, April 10, 2015

Intel DDIO, LLC cache, buffer alignment, prefetching, shared locks and packet rates.

I've been digging into the low level behaviour of high throughput packet classification and pushing for my job. The initial suggestions from everyone was "use netmap!" Which was cool, but it only seems to to fast packet work if you're only ever really flipping packets between receive and transmit rings. Once you start actually looking into the payload, you start having to take memory misses and things can slow down quite a bit. An L3 miss (ie, RAM access) on Sandybridge is ~50ns. (There's also costs involved in walking the TLB, but I won't cover that here.)

For background: http://7-cpu.com/cpu/SandyBridge.html .

But! Intel has this magical thing called DDIO. In theory (and there's a lot of theory here), DMA is done via a small (~10%) fraction of LLC (L3) cache, which is shared between all cores. If the data is already in cache when the CPU accesses it, it will be quick. Also, if you then wish to DMA out data from something in cache, it doesn't have to get flushed to memory first - it's just DMAed straight out of cache.

However! When I was doing packet bridge testing (using netmap + bridge, 64 byte payloads), I noticed that I was doing a significant amount of memory bandwidth. It wasn't quite at the rate of 10G worth of bridged data, but DDIO should be doing almost all of that work for me at 64 byte payloads.

So, to reproduce: run netmap bridge (eg 'bridge -i netmap:ix0 -i netmap:ix1') and run pkt-gen between two nodes.

This is the output of 'pcm-memory.x 1' from the intel-pcm toolkit (which is available as a binary package on FreeBSD.)

---------------------------------------||---------------------------------------
--                   System Read Throughput(MB/s):    300.68                  --
--                  System Write Throughput(MB/s):    970.81                  --
--                 System Memory Throughput(MB/s):   1271.48                  --
---------------------------------------||---------------------------------------

The first theory - the bridging isn't occuring fast enough to service what's in LLC before it gets flushed out by other packets. So, assume:

  1. It's 1/10th of the LLC - which is 1/10th of an 8 core * 2.5MB per core setup, is ~ 2MB.
  2. 64 byte payloads are being cached.
  3. Perfect (!) LLC use.
That's 32,768 packets at a time. Now, netmap is doing ~ 1000 packets a batch and it's keeping up line rate bridging on one core (~14 million packets per second), so it's not likely that.

Ok, so what if it's not perfect LLC usage?

Then I thought back to cache line aliasing and other issues that I've previously written about. What if the buffers are perfectly aligned (say, 2048 byte aligned) - the cache line aliasing effects should also manifest themselves as low LLC utilisation.

Luckily netmap has a twiddle - 'dev.netmap.buf_size' / 'dev.netmap.priv_buf_size'. They're both .. 2048. So yes, the default buffer sizes are aligned, and there's likely some very poor LLC utilisation going on.

So, I tried 1920 - that's 2048 - (2 * 64) - ie, two cache lines less than 2048.


---------------------------------------||---------------------------------------
--                   System Read Throughput(MB/s):    104.92                  --
--                  System Write Throughput(MB/s):    382.32                  --
--                 System Memory Throughput(MB/s):    487.24                  --
---------------------------------------||---------------------------------------

It's now using significantly less memory bandwidth to do the same thing. I'm guessing this is because I'm now using the LLC much more efficiently.

Ok, so that's nice - but what about when it comes time to actually look at the packet contents to make decisions?

I've modified a copy of bridge to do a few things, mostly inspired by netmap-ipfw:
  • It does batch receive from netmap;
  • but it then looks at the ethernet header do decap that;
  • then it gets the IPv4 src/dst addresses;
  • .. and looks them up in a (very large) traditional hash table.
I also have a modified copy of pkt-gen that will use completely random source and destination IPv4 addresses and ports, so as to elicit some very terrible behaviour.

With an empty hash set, but still dereferencing the ethernet header and IPv4 source/destination, handling a packet at a time, no batching, no prefetching and only using one core/thread to run:

buf_size=2048:
  • Bridges about 6.5 million pps;
  • .. maxes out the CPU core;
  • Memory access: 1000MB/sec read; 423MB/sec write (~1400MB/sec in total).
buf_size=1920:
  • Bridges around 10 million pps;
  • 98% of a CPU core;
  • Memory access: 125MB/sec read, 32MB/sec write, ~ 153MB/sec in total.
So, it's a significant drop in memory throughput and a massive increase in pps for a single core.

Ok, so most of the CPU time is now spent looking at the ethernet header in the demux routine and in the hash table lookup. It's a blank hash table, so it's just the memory access needed to see if the bucket has anything in it. I'm guessing it's because the CPU is loading in the ethernet and IP header into a cache line, so it's not already there from DDIO.

I next added in prefetching the ethernet header. I don't have the code to do that, so I can't report numbers at the moment. But what I did there was I looped over everything in the netmap RX ring, dereferenced the ethernet header, and then did per-packet processing. This was interesting, but I wanted to try batching out next. So, after some significant refactoring, I arranged the code to look like this:
  1. Pull in up to 1024 entries from the netmap receive ring;
  2. Loop through, up to 16 at a time, and place them in a batch
  3. For each packet in a batch do:
    1. For each packet in the batch: optional prefetch on the ethernet header
    2. For each packet in the batch: decapsulate ethernet/IP header;
    3. For each packet in the batch: optional prefetch on the hash table bucket head;
    4. For each packet in the batch: do hash table lookup, decide whether to forward/block
    5. For each packet in the batch: forward (ie, ignore the forward/block for now.)
I had things be optional so I could turn on/off prefetching and control the batch size.

So, with an empty hash table, no prefetching and only changing the batch size, at buf_size=1920:
  • Batch size of 1: 10 million pps;
  • Batch size of 2: 11.1 million pps;
  • Batch size of 4: 11.7 million pps.
Hm, that's cute. What about with prefetching of ethernet header? At buf_size=1920:
  • Batch size of 1: 10 million pps;
  • Batch size of 2: 10.8 million pps;
  • Batch size of 4: 11.5 million pps.
Ok, so that's not that useful. Prefetching on the bucket header here isn't worthwhile, because the buckets are all empty (and thus NULL pointers.)

But, I want to also be doing hash table lookups. I loaded in a reasonably large hash table set (~ 6 million entries), and I absolutely accept that a traditional hash table is not exactly memory or cache footprint happy. I was specifically after what the performance was like for a traditional hash table. Said hash table has 524,288 buckets, and each points to an array of IPv4 addresses to search. So yes, not very optimal by any measure, but it's the kind of thing you'd expect to find in an existing project.

With no prefetching, and a 6 million entry hash table:

At 2048 byte buffers:
  • Batch size of 1: 3.7 million pps;
  • Batch size of 2: 4.5 million pps;
  • Batch size of 4: 4.8 million pps.
At 1920 byte buffers:
  • Batch size of 1: 5 million pps;
  • Batch size of 2: 5.6 million pps;
  • Batch size of 4: 5.6 million pps.
That's a very inefficient hash table - each bucket is going to have around 11 IPv4 entries in it, and that's checking almost a cache line worth of IPv4 addresses in it. Not very nice. But, it's within a cache line worth of data, so in theory it's not too terrible.

What about with prefetching? All at 1920 byte buffers:
  • Batch size of 4, ethernet prefetching: 5.5 million pps
  • Batch size of 4, hash bucket prefetching: 7.7 million pps
  • Batch size of 4, ethernet + hash bucket prefetching: 7.5 million pps
So in this instance, there's no real benefit from doing prefetching on both.

For one last test, let's bump the bucket count from 524,288 to 2,097,152. These again are all at buf_size=1920:
  • Batch size of 1, no prefetching: 6.1 million pps;
  • Batch size of 2, no prefetching: 7.1 million pps;
  • Batch size of 4, no prefetching: 7.1 million pps;
  • Batch size of 4, hash bucket prefetching: 8.9 million pps.
Now, I didn't quite predict this. I figured that since I was reading in the full cache line anyway, having up to 11 entries in it to linearly check would be cheap. It turns out that no, that's not exactly true.

The difference between the naive way (no prefetching, no batching) to 4-packet batching, hash bucket prefetching is not trivial - it's ~ 50% faster. Going all the way to a larger hash bucket was ~75% faster. Now, this hash implementation is not exactly cache footprint friendly - it's bigger than the LLC, so with random flows and thus no real useful cache behaviour it's going to degrade to quite a few memory accesses.

This has been quite a fun trip down the optimisation peephole. I'm going to spend a bunch of time writing down the hardware performance counters involved in analysing this stuff and I'll look to write a follow-up post with details about that.

One final things: threads and locking. I wanted to clearly demonstrate the cost of shared read locks on a setup like this. There's been lots of discussions about the right kind of locking and concurrency strategies, so I figured I'd just do a simple test in this setup and explain how terrible it can get.

So, no read-locks between threads on the hash table, batch size of 4, hash bucket prefetching, buf_size=1920:
  • 1 thread: 8.9 million pps;
  • 4 threads: 12 million pps.
But with a read lock on the hash table lookups:
  • 1 thread: 7 million pps;
  • 4 threads: 4.7 million pps.
I'm guessing that as I add more threads, the performance will drop.

Even taking a rwlock as a reader lock in pthreads is expensive - it's purely just an atomic increment/decrement in FreeBSD, but it's still not free. I'm getting the lock once for two hash table lookups - ie, the source and destination IP hash table lookups are done under one lock. I'm sure if I took the lock for the whole batch hash table lookup it'd work out a little better on a small number of CPU cores, but I think this demonstrates my point - read locks aren't going to cut it when you have a frequently accessed thing to protect.

The best bit about this post? The prefetching, terrible (large) hash table performance and general cache abuse is not new. Doing batching on superscalar Intel CPUs is not new. Documenting DDIO effectiveness using non-power-of-two-aligned buffer sizes is new, but it's just a rehash of the existing cache aliasing effect. But, I now have a little test bed to experiment with these things without having to try and involve the rest of a kernel.

Yes, I'll publish code soon.