Re: [PATCH 1/6] dm raid45 target: export region hash functions and add a needed one

Doug Ledford <dledford@xxxxxxxxxx> · Tue, 7 Jul 2009 14:38:32 -0400

On Jul 5, 2009, at 11:21 PM, Neil Brown wrote:
Here your code seems to be 2-3 times faster!
Can you check which function xor_block is using?
If it is :
 xor: automatically using best checksumming function: ....
then it might be worth disabling that test in calibrate_xor_blocks and
see if it picks one that ends up being faster.

There is still the fact that by using the cache for data that will be
accessed once, we are potentially slowing down the rest of the system.
i.e. the reason to avoid the cache is not just because it won't
benefit the xor much, but because it will hurt other users.
I don't know how to measure that effect :-(
But if avoiding the cache makes xor 1/3 the speed of using the cache
even though it is cold, then it would be hard to justify not using the
cache I think.

So, Heinz and I are actually both looking at xor speed issues, but  
from two different perspectives.  While he's comparing some of the  
dmraid45 xor stuff to the xor_blocks routine in crypto/, I'm  
specifically looking at that "automatically using best checksumming  
function" routine.  For the last 9 or so years, we've automatically  
opted for the SSE + non-temporal store routine specifically because  
it's not supposed to pollute cache.  However, after even just a  
cursory reading of the current Intel architecture optimization guide,  
it's obvious that our SSE routine is getting rather aged, and I think  
the routine is in serious need of an overhaul.  This is something I'm  
currently looking into.  But, that raises the question of how to  
decide whether or not to use it, either in its current form or any new  
form it might take.  As you point out, the tradeoff between cache  
polluting and non-cache polluting is hard to quantify.

We made a significant error when we originally wrote the SSE routines,  
and Heinz just duplicated it.  Specifically, we tested performance on  
a quiescent system.  For the SSE routines, I think this is a *major*  
error.  The prefetch instructions need to be timed such that the  
prefetch happens at roughly the right point in time to compensate for  
the memory latency in getting the data to L1/L2 cache prior to use by  
the CPU.  Unfortunately, memory latency in a system that is quiescent  
is drastically different than latency in a system with several CPUs  
actively competing for RAM resources on top of 100MB/s+ of DMA  
traffic, etc.  When we optimized the routines in a quiescent state, I  
think we got our prefetches too close to when the data was needed by  
the CPU under real world use conditions and that's impacting the  
operation of the routines today (or maybe we did get it right, but  
changes in CPU speed relative to memory latency have caused the best  
prefetch point to change over time, either way the current SSE xor  
routine appears to be seriously underperforming in my benchmark tests).

Likewise, Heinz's tests were comparing cold cache to hot cache and  
trying to find a break over point where we switch from one to the  
other.  But that question necessarily depends on other factors in the  
system including what other cores on the same die are doing as that  
impacts the same cache.

So if the error was to not test and optimize these routines under  
load, then the right course of action would be to do the opposite.   
And that leads me to believe that the best way to quantify the  
difference between cache polluting and non-cache polluting should  
likewise not be done on a quiescent system with a micro benchmark.   
Instead, we need a holistic performance test to get the truly best xor  
algorithm.  In my current setup, the disks are so much faster than the  
single threaded xor thread that the bottleneck is the xor speed.  So,  
what does it matter if the xor routine doesn't pollute cache if the  
raid is so slow that programs are stuck in I/O wait all the time as  
the raid5 thread runs non-stop?  Likewise, who cares what the top  
speed of a cache polluting xor routine is if in the process it evicts  
so many cache pages belonging to the processes doing real work on the  
system that now cache reload becomes the bottleneck.  The ultimate  
goal of either approach is overall *system* speed, not micro benchmark  
speed.  I would suggest a specific, system wide workload test that  
involves a filesystem on a device that uses the particular raid level  
and parity routine you want to test, and then you need to run that  
system workload and get a total time required to perform that specific  
work set, CPU time versus idle+I/O wait time in completing that work  
set, etc.  Repeat the test for the various algorithms you wish to  
test, then analyze the results and go from there.  I don't think  
you're going to get a valid run time test for this, instead we would  
likely need to create a few heuristic rules that, combined with  
specific CPU properties, cause us to choose the right routine for the  
machine.

--

Doug Ledford <dledford@xxxxxxxxxx>

GPG KeyID: CFBFF194
http://people.redhat.com/dledford

InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband

Attachment:
PGP.sig

Description: This is a digitally signed message part
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel