Monday, September 16, 2013

The gymnastics required to just do a "HALT" for MIPS..

So, it turns out there's no nice, guaranteed way to implement a HALT style setup for MIPS in the idle thread in the kernel. I'll braindump what happened about two years ago to try and address this.

(And I do hope that the implementation actually works!)

When the kernel isn't running any work, it will schedule the "idle" thread to run. This has the simple(!) task of entering a sleep state, only to wake up when another interrupt fires. This saves power consumption and generates less heat.

However, it's a little trickier than just "enter idle state."

The only way to exit it is to receive an interrupt. Now, this can be device interrupts; this can be timer interrupts (and yes; the timer is a kind of device, tsk..) but without it, the CPU will stay halted. This isn't a big deal - the UNIX kernel tends to be one big event processing "thing" and these include both software and hardware events. If it's a software event that needs to happen "now", the kernel won't go into the idle loop - it'll just run the needed event. If it has to happen in the future, it'll schedule it to occur after some time has elapsed - and this will be driven by a timer interrupt. An interrupt would wake up the system from the HALT state. If the interrupt occured just before the HALT instruction was executed, there'd be a very small window where the interrupt would occur - the interrupt routine would be called, complete, and then the HALT instruction would next execute.

Now, imagine you're an ethernet driver on FreeBSD-10 where the interrupt handler just scheduled some ithread to run in the future. Here's one example of what may happen:

  • The idle thread is executed;
  • An interrupt occurs, say to signal completion of some ethernet frame transmission;
  • The CPU kicks off the interrupt code;
  • The interrupt code doesn't find a fast interrupt handler but finds an ithread; so it schedules the ithread to run;
  • The scheduler goes to choose another thread to run and finds an ithread scheduled at a higher priority, so it schedules that;
  • The ethernet driver code in the ithread runs;
  • The scheduler then exits and re-enters the idle loop.
Everything is fine, right?

But, with the event timer changes that came in during the 9.x time frame, the halt code is now in a critical section.

So now, the following happens:

  • The idle thread is executed;
  • critical_enter() is called;
  • An interrupt occurs, which calls the fast handler, which schedules the ithread to run;
  • .. but critical_enter() has set a flag that says "no preemption", so the ithread doesn't get to run;
  • Control is returned to the idle function;
  • The idle thread gets to the point where it executes "HALT";
  • The idle thread continues to run, executing HALT.
In this instance, the ithread is scheduled but can't run before HALT runs. The ithread only runs after the next interrupt occurs.

I noticed this was happening when doing traffic tests on FreeBSD-HEAD between 802.11 and ethernet interfaces. The atheros 802.11 hardware implemented interrupt moderation so there wouldn't be tens of thousands of interrupts a second being generated. But what I saw was occasionally the receive queue being filled by packets and not drained fast enough. When digging into it, I found that due to interrupt moderation, if the interrupt came in just before WAIT was executed, the 802.11 receive function was taking a long time (sometimes up to milliseconds) to run after the interrupt came in and the ithread was actually scheduled. If I had an interrupt for each received packet, the amount of time between interrupts would have been very small (20,000 packets a second, so around 1/20000 sec per interrupt) and this problem would've been masked. But with moderated interrupts, it would be 750 microseconds or so before the next receive interrupt was generated.

Now, this is messy. There's some hacks in the idle loop code to try and skip the halt bit if the scheduler detects there's something to run. But there's still a small race window there which needs to be closed.

How can this be solved?

Apparently - the two instructions STI;HLT on x86 are atomic. Ie, there's no race window between them. If an interrupt comes through, HLT doesn't run. This doesn't happen for MWAIT or the ACPI sleep states and I am concerned we're still possibly hitting this race window from time to time. The specific behaviour is that the STI causes interrupts to be enabled following the next instruction. So yes, there's no window.

All that's left now is to make sure that interrupts are disabled before you do the scheduler check so no new interrupt processing on that thread can be scheduled. Ie, when entering cpu_idle():

  • Call critical_enter();
  • Disable interrupts;
  • See if the scheduler has anything to do - if so, enable interrupts and skip calling the idle loop
  • If there's nothing to do - and since interrupts are disabled, nothing new will have happened (like an interrupt scheduling an ithread) then just call the idle function
    • Which may or may not enable interrupts before entering the idle loop (you enter ACPI with interrupts disabled, but you enter HLT with interrupts enabled)
  • .. and then call critical_exit(), which will let the kernel continue preempting.
For MIPS however, there's a clever hack. (No, not from me.)

Here's how it works:
  • In the idle loop, it calls mips_wait()
  • mips_wait() is a bit of assembly code that will:
    • disable interrupts
    • see if the scheduler sees anything running
    • .. and if so, it doesn't bother running the WAIT instruction! Just enable interrupts and jump over the WAIT;
  • .. but the bit of code that re-enables interrupts and calls WAIT is aligned to a 16 byte boundary and the address is a symbol (MipsWaitStart).
  • Then in the exception handling code (MipsKernIntr), it sees if the instruction pointer where the exception occured is in the 16 bytes (4 instructions) at MipsWaitStart.
  • .. if it is, it adjusts the return address from the interrupt to be after the WAIT instruction.
It's totally dirty and to be quite honest, I haven't at all tested it. Yes, it should be tested.


Tuesday, September 3, 2013

Finding low hanging fruit with PMC, or "O(wtf)" ?

I've lately been focusing on performance counter stuff on Sandy Bridge (Xeon and non-Xeon.) Part of this has been fixing some of the counters that were wrong. Part has been digesting the Intel tuning guides and the Intel micro-architecture for Sandy Bridge. It's a little different to the older school pipeline driven architecture that rules the MIPS world.

So, I fired up some of my scripts (at http://github.com/erikarn/hwpmc) on a live cache pushing a whole lot of live video netflix traffic. The scripts use the PMC framework in global counter mode rather than sampling mode, so it's cheap to do and doesn't affect performance.

What I found:

  1. The pipeline slots per cycle metric is around 16% - so there's a lot of stalling going on.
  2. There's a lot of memory traffic going on - around 50% of clock cycles are spent in LLC_MISS - ie, it wasn't in L1, L2 or L3/LLC (last-level cache) and thus has to be fetched from memory.
So, I started digging into why there were so many memory accesses. It turns out the biggest abuser was the cross-CPU IPI involved in synchronising page mapping tables - there are a few places calling pmap_invalidate_range() as part of sendfile() buffer completion and this was causing issues. I pointed this out, someone else has addressed it internally. (Ideally if the IO path uses unmapped buffers on amd64, there shouldn't be any need to map them in and out of KVA.) I think that saved about 4% of total clock cycles spent being stalled.

Then I found a lot of stalling going on in the mwait and ACPI sleep path. It turns out that these paths seem to involve doing ISA space IO port accesses. These are .. very slow. I've just flipped my testing over to use no mwait and use HLT.

Next - flowtable had been turned on during experimentation and I had noticed that the flowtable expire/flush code would periodically spike up. It spiked up more when more clients and more TCP flows were connected. It showed up in both memory accesses and clock cycles busy PMCs - and the reason made me laugh out loud.

The flowtable uses a bitstring_t - effectively an array of bytes treated as a bitmap, like select() FD_SET's - and would walk this to look for flows to expire.

The expiry code would walk the list looking for flows to expire - it would loop over the entire set, calling ffs() over the whole set to look for the next new flow to check.

.. so looping over looping over the whole set. O(n^2). Right there, in the flow cleaning path. Doing byte offset fetches, rather than 32-bit fetches. Everything about it was ridiculous. As we scaled up to serve more flows the flowcleaner CPU cycle count was spiking really, really hard.

I pointed this out in an email to my coworkers and fell asleep. It was fixed when I awoke - a co-worker fixed it to be correctly O(n) whilst I was sleeping. It's now totally disappeared from the CPU cycle and stall analysis.

So, I've just been chipping away at things here and there. There are some larger scale issues that I really want to address but I'd like to make sure all the odd, silly and remaining low hanging fruit are addressed. Then comes the fun stuff.