Thursday, April 9, 2009

Fixing the disk code, part 4

My changes are now occasionally randomly crashing in malloc() - which means something is modifying memory it shouldn't and confusing things. It's a shame that I can't run this under valgrind - the required traffic load to generate this issue at the moment makes valgrind impossible to use.

My guess is that something is ripping the read buffer out from underneath the pread() in the worker thread. Figuring out exactly where this is happening is going to be tricky.

The prime candidate at the moment is where the read is cancelled indirectly via a call to aioClose(). aioClose() calls aioCancel() which attempts to cancel all of the pending events but I'm absolutely sure that it is possible for a read event to be in progress. The fact this is occurring was hidden by the read() buffer being a temporary local buffer - the read() would complete (error or not) and then the result would be tossed away.

The solution? I'm not sure. I have a feeling I'm going to be introducing something like AsyncCalls in Squid-3 to delay the IO callbacks from occuring - but then, for those keeping score, the AsyncCalls stuff in squid-3 (in its current inclusion) has made some code unstable and other bits of code difficult to debug. I wonder what repercussions there would be to splitting out the generation of IO callbacks and their execution; it would certainly help with diskd..

No comments:

Post a Comment