Monday, June 30, 2008

Initial Profiling!


Here's the output trace: I'm running it on the Sun X2100 running a flavour of ubuntu; this is doing ~ 300mbit FDX at about 9000 req/sec (tiny transactions!) w/ 1000 concurrent connections; I'm specifically trying to trace the management overhead versus the data copying overhead. This has maxed out both thttpd on the server-side and the tcp proxy itself.



Gah, look at all of those mallocs and stdio calls doing "stuff"..



root@rachelle:/home/adrian/work/cacheboy/branches/CACHEBOY_PRE/app/tcptest# opreport -l ./tcptest | less
CPU: AMD64 processors, speed 2613.43 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
96851 11.3738 libc-2.6.1.so vfprintf
62317 7.3182 libc-2.6.1.so _int_malloc
37556 4.4104 tcptest comm_select
35405 4.1578 tcptest commSetEvents
32901 3.8638 libc-2.6.1.so _int_free
30245 3.5518 tcptest commSetSelect
29890 3.5102 tcptest commUpdateEvents
28812 3.3836 libc-2.6.1.so _IO_default_xsputn
20360 2.3910 tcptest sslSetSelect
17279 2.0292 libc-2.6.1.so malloc_consolidate
16610 1.9506 libc-2.6.1.so epoll_ctl
16307 1.9150 tcptest sslReadServer
16154 1.8971 libc-2.6.1.so fcntl
14601 1.7147 tcptest xstrncpy
12003 1.4096 libc-2.6.1.so memset
11617 1.3643 tcptest memPoolAlloc
10931 1.2837 libc-2.6.1.so calloc

First milestone - code reuse!

I've spent the last couple of evenings committing code to break out the last bits of the core event loop. I then added in a chopped up copy of src/ssl.c (the SSL CONNECT tunneling stuff) and voila! I now have a TCP proxy.

A (comparitively) slow TCP proxy (3000 small obj/sec instead of where it should be: ~10,000 small obj/sec). A slow, single-threaded TCP proxy, but a TCP proxy nonetheless.

I can now instrument just the core libraries to find out where they perform and scale poorly, seperate from the rest of the Squid codebase. I count this as a pretty big milestone.

Thursday, June 26, 2008

Why memory allocation is a pain in the ass..

The memory allocator routines in particular are annoying - a lot of work has gone into malloc implementations over the last few years to make them perform -very- well in threaded applications as long as you know what you are doing. This means doing things like allocating/freeing memory in the same thread and limiting memory exchange between threads (mostly a deal with very small allocations).

Unfortunately, the mempools implementation saves a noticable amount of CPU because it hides all of the repetitive small memory allocations which Squid does for a variety of things. Its hard to profile too - I see that the CPU spends a lot of time in the allocator, but figuring out which functions are causing the CPU usage is difficult. Sure, I can find out the biggest malloc users by call - but they're not the biggest CPU users according to the oprofile callgraphs. I think I'll end up having to spend a month or so rewriting a few areas of code that account for the bulk of the malloc'ing to see what affect it has on CPU before I decide what to do here.

I just don't see the point in trying to thread the mempools codebase for anything other than per-pool statistics when others have been doing a much better job of understanding memory allocation contention on massively parallel machines.

Cacheboy-1.3 (.1) released; short-term future

I've just merged the latest Squid-2.HEAD changes into Cacheboy and released 1.3.1.

1.3 and 1.3.1 fix the Vary related issues which affects hit rates.

1.3.1 fixes the SNMP counter bugs.

This ends the first set of mostly non-intrusive changes which have been made to the codebase. The next area of work will be pulling out the rest of the event/communications/signal code from src/ and into libiapp/ so I can begin treating "Squid" as a client of "libiapp" - ie, the libiapp code handles event, fd, communication and event scheduling (disk stuff is still in src/ for now) making callbacks into the Squid application. I can then begin writing a few test applications to give the core and support libraries a good thrashing.

I'll start planning out threading and ipv6 support in the libraries themselves with the minimum amount of Squid changes required to continue functioning (but still staying in IPv4/non-threaded land.) The plan is to take something like a minimalistic TCP proxy thats been fully debugged and use it as the basis for testing out potential IPv6 and threading related changes, seperate from the rest of the application.

My tentative aim is to run the current "Squid" application in just one thread but have the support libraries support threading (either by explicitly supporting concurrency or being labelled as "not locking" and thus callers must guarantee nothing quirky will happen.) The three areas that strike me as being problematic right now are the shared fd/comm state (fd_table[]), the statistics being kept all over the place and the memory allocator routines. (I'll write up the malloc stuff in a different post.)

Tuesday, June 24, 2008

Current CPU usage

Where's the CPU going?

Here's an oprofile output from a very naive custom polygraph workload, ~ 1000 requests a second, ~14kbyte objects. MemPools are disabled; Zero buffers are off so the majority of the allocations aren't zero'ed.


Note that somehow, memPoolAlloc takes 4% of CPU even with memory pools switched off. The allocations still go via the pool code but deallocations aren't "cached". What the hell is taking the 4% of CPU time?



CPU: AMD64 processors, speed 2613.43 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
176014 6.7039 libc-2.6.1.so _int_malloc
160779 6.1236 libc-2.6.1.so memcpy
128371 4.8893 libc-2.6.1.so malloc_consolidate
123734 4.7127 squid memPoolAlloc
101514 3.8664 libc-2.6.1.so free
76772 2.9240 libc-2.6.1.so _int_free
55696 2.1213 libc-2.6.1.so malloc
55681 2.1207 libc-2.6.1.so vfprintf
50245 1.9137 libc-2.6.1.so calloc
48095 1.8318 squid httpHeaderIdByName
41172 1.5681 libm-2.6.1.so floor
37573 1.4310 libc-2.6.1.so re_search_internal
37434 1.4258 libc-2.6.1.so memchr
36536 1.3916 squid xfree
30646 1.1672 libc-2.6.1.so memset
30576 1.1646 squid memPoolFree
30108 1.1467 squid headersEnd
28626 1.0903 squid httpHeaderGetEntry
26668 1.0157 squid storeKeyHashCmp
...

Thursday, June 19, 2008

Updates - comm code, etc

I've finally managed to divorce the comm code from the base system. Its proving to be a pain in the butt for a few reasons:

  • The DNS code is involved in the socket connection path - most users just pass a hostname in to the comm connect call and it gets diverted via the ipcache/dns code. Tsk!
  • There's quite a bit of statistics gathering which goes on - the code is very monolithic and the statistics code keeps 5/60 minute histograms as well as raw counters
  • The event loop needs to be sorted out quite a bit better - right now the event loop is still stuck in src/main.c and this needs to change
The statistics gathering and reporting for the network/disk syscalls and events will have to change - I don't feel like trying to make the histogram code more generic and modular. I don't think that Squid should be maintaining the histograms - thats the job for a reporting suite. Squid should just export raw counters for a reporting suite to record and present as appropriate. I'll add in a new cachemgr option to report the "application core" statistics in a machine-parsable manner and leave it at that for now. (As a side-note, I also don't think that Squid should have SNMP code integrated. It should have an easier, cleaner way of grabbing statistics and modifying the configuration and an external SNMP daemon to do SNMP stuff.)

I then need to extract out the main event loop somewhat from src/main.c and turn it into something that can be reused. The main loop handles the following:
  • comm events
  • store dir events
  • timed/immediate registered events
  • signals - which basically just set global variables!
  • checking signal global variables - for rotate, shutdown, etc
I think I'll implement a libevent setup of sorts - I'll implement some methods in libiapp to register callbacks to occur when certain signals are set (sort of like libevent) but the storedir and signal global variable handler will just be functions called in the src/main.c loop. I'd like to implement a Squid-3 like method of registering event dispatchers but I will leave all of that alone until this is all stable and I've done planning into concurrency and SMP.

Its also possible that the reasons for registering dispatchers will go away with a slightly more sensible event abstraction (eg, if I convert the signal handlers to proper events (exactly like libevent!) which get pushed into the head of the event queue and called at the beginning of the next loop iteration - this however assumes the global variables that are set in the current signal handlers are only checked in the main loop and not elsewhere..!)
 

Wednesday, June 11, 2008

Async IO related hackery

I've been staring at the Async IO code in preparation to migrate stuff out of the aufs directory and into a seperate library.

It requires the fd tracking code (understandable) and the comm code to monitor a notification pipe. This monitor pipe was used by the worker threads to wake up the main process if its waiting inside a select()/poll()/etc call, so it can immediately work on some disk IO.

Squid checks the aio completion queues each time through the comm loop. For aio, there isn't a per-storedir queue, there's just a global queue for all storedirs and other users, so aioCheckCallbacks() is called for each storedir.

There are two problems - firstly, select()/poll() take a while to run on a busy cache, so aioCheckCallbacks() isn't called that often. But the event notification based mechanisms end up running very often, returning a handful of filedescriptors each pass through the comm loop - and so the storedir checks are called. Secondly, its called once per storedir, so if you have 10 storedirs (like I have for testing!) aioCheckCallbacks() is called 10 times per IO loop.

This is a bit silly!

Instead, I've modified the async IO code to only call aioCheckCallbacks() when that pipe is written to. This ends up being the "normal" hack that UNIX thread programmers do to wake up a thread stuck waiting for both network and thread events. This cuts back substantially on the number of aioCheckCallbacks() calls without impacting performance (as far as I can see.)

Next! By default, the aufs store code only does async IO for open() and read() - write() and close() doesn't run asynchronously. Apparently this is due to testing under Linux - unless you're stressing the buffer cache too hard, write() to a disk FD didn't block, so there wasn't a reason to run write() and close() async. Apparently Solaris close() will block as metadata writes are done synchronously, and its possible FreeBSD + softupdates may do something similar. This is all "apparently", I haven't sat down and instrumented any of this!

FreeBSD and Solaris users have reported that diskd performs better than aufs - something I don't understand, as diskd only handles one outstanding disk IO at a time with similar issues with write() and close() to aufs (namely, if the calls block, the whole diskd process stops handling disk IO) but the difference here is the main process won't hang whilst these syscalls complete. Perhaps this is a reason for this behaviour. Its difficult for me to test; aufs has always performed fantastically for me.

There's so much to tidy up and reorganise, I still can't sit down and begin implementing any of the new features I want to!

Thursday, June 5, 2008

More reorganisation..

I've moved cbdata, mempools, fd and legacy disk (file_*) routines out of src/. I also shuffled the comm related -definitions- out but not the code. I hit a big of a snag - the comm code path used for connecting a socket to a remote site actually uses the DNS code. Fixing this will involve divorcing the DNS lookup stuff so sockets can be connected to a remote IP directly - and the DNS lookup path will just be in another module.

This however is more intrusive than "code reorganisation" so its going to have to wait a while. Unfortunately, this means that my grand plans for 1.1 will have to be put on hold a little until I've thought this out a little more and implemented it in a seperate branch.

Thus, things will change a little. 1.1 will be released shortly, with the current set of changes included. I'll then concentrate on planning out the next set of changes required to properly divorce the core event/disk code from src/.

Why do this? Well, the biggest reason is to be able to build "other" bits of code which reuse the Squid core. I can write unit tests for a lot of stuff, sure, but it also means I can write simple network and disk applications which reuse the Squid core and find exactly how hard I can push them. I can also break off a branch and hack up the code to see what impact changes make without worrying that said changes expose strange stuff in the rest of the Squid codebase.

The four main things that I'd like to finally sort out are:

  • IPv6 socket support - support v4/v6 in the base core, and make sure that it works properly
  • Sort out the messy disk related code and reintegrate async IO as a top-level disk producer (like it was in Squid-2.2 and it almost is in Squid-3) so it can be again used for things like logfile writing!
  • Begin looking at scatter/gather disk and network IO - gather disk IO should work out great for writing logfile buffers and objects to disk, for example
  • Design a parallelism model which allows multiple threads to cooperate on tasks - worker threads implementing callback type stuff for some work; entire seperate network event threads (look at memcached as an example.) "Squid" as it stands will simply run as one thread, but some CPU intensive stuff can be pushed into worker threads for some cheap parallelism gains (as on the roadmap, ACLs and rewriting/content manipulation are two easy targets.)
So there's a lot of work to do, and not a lot of time to do it in.