Friday, March 6, 2015

cache line aliasing effects, or "why is freebsd slower than linux?"

There was some threads on FreeBSD/DragonflyBSD mailing lists a few years ago (2012?) which talked about some math benchmarks being much slower on FreeBSD/DragonflyBSD versus Linux.

When the same benchmark is run on FreeBSD/DragonflyBSD using the Linux layer (ie, a linux binary compiled for linux, but run on BSD) it gives the same or better behaviour.

Some digging was done, and it turned out it was due to memory allocation patterns and memory layout. The jemalloc library allocates large chunks at page aligned boundaries, whereas the allocator in glibc under Linux does not.

I've put the code online in the hope that others can test and verify this:

https://github.com/erikarn/himenobmtxpa

The branch 'local/freebsd' has my local change to allow the allocator offset to be specified. The offset compounds on each allocation - so with an 'n' byte offset, the first allocation is 0 bytes offset from the page boundary, the next is 'n' bytes offset from the page boundary, the next is '2n' bytes offset, etc.

You can experiment with different values and get completely different behavioural results. It's non-trivial: there's a 100% speedup by using a 127 byte offset for each allocation, versus a 0 byte offset.

I'd like to investigate cache line aliasing effects further. There was work done a few years ago to offset mbuf headers in the FreeBSD kernel so they weren't all page-aligned or 256/512/1024 byte aligned - and apparently this gave a significant performance improvement. But it wasn't folded into FreeBSD. What I'd like to do is come up with some better strategies / profiling guides for identifying when this is actually happening so the underlying objects being accessed can be adjusted.

So - if anyone out there has any tips, hints or suggestions on how to do this, please let me know. I'd like to document and automate this testing.

4 comments:

  1. What you're saying are is that well aligned chunks of memory are more likely to be mapped to the same cache line and thus they kick each other out of the cache frequently?

    That would mean the optimal strategy is CPU specific, I figure that would mean beefing up kern.sched.topology_spec with information about cache sizes, line sizes and strategy.
    Then you'd have to figure out which cache level to optimise for (is there a function that works well for multiple level?).

    I don't know anything about the amd64 arch cache layouts, but I'm assuming it isn't composed of LRU caches, at least on level 1, because they wouldn't exhibit this kind of behaviour.

    The cheap strategy would be to turn off alignment and just squeeze allocated chunks of memory as close together as possible (word aligned, maybe).

    ReplyDelete
  2. I'm surprised that a 127-byte offset worked at all, let alone improved performance, as every single memory access (except for strings) would be unaligned. The offset needs to be a multiple of 16 or whatever the minimum stack alignment is on your machine, i.e. the lowest common multiple of the alignments of all supported types.

    ReplyDelete
  3. The unaligned accesses for classic register stuff (ie, non-SSE, etc) work; they're just slow.

    The point is that although it's slower to do all unaligned accesses, the cache efficiency trumps it entirely.

    ReplyDelete
  4. Not sure if my last post made it through. Assuming cache utilization isn't horrible to begin with, unaligned memory accesses have negligible to no impact on Sandy Bridge+ either way.

    ReplyDelete