Re: deleting 2TB lots of files with delaylog: sync helps?

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Thu, 02 Sep 2010 09:57:43 -0500

Thanks Dave.

I don't normally top post, but I just wanted to quickly say I _really_
enjoyed reading your reply below.  It was seriously educational.  I
really enjoyed your note about the 24p Altix system.  I've been a fan of
SGI NUMA machines since the Origin 2k, due to the uniqueness of this
scalable interconnect, though I've never been an SGI user. :(

Keep up the great work, and keep us educated, as you've done so very
well here. :)

-- 
Stan

Dave Chinner put forth on 9/2/2010 6:29 AM:
> On Thu, Sep 02, 2010 at 03:41:59AM -0500, Stan Hoeppner wrote:
>> Dave Chinner put forth on 9/2/2010 2:01 AM:
>>
>>> No, that's definitely not the case. A different kernel in the 
>>> same 8p VM, 12x2TB SAS storage, w/ 4 threads, mount options "logbsize=262144"
>>>
>>> FSUse%        Count         Size    Files/sec     App Overhead
>>>      0       800000            0      39554.2          7590355
>>>
>>> 4 threads with mount options "logbsize=262144,delaylog"
>>>
>>> FSUse%        Count         Size    Files/sec     App Overhead
>>>      0       800000            0      67269.7          5697246
>>
>> What happens when you bump each of these to 8 threads, 1 per core?  If
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0      1600000            0     127979.3         13156823
> 
> So, 1 thread does 19k files/s, 2 thread does 37k files/s, 4 gets
> 67k, and 8 gets 128k. I'd say that's almost linear scaling and CPU
> bound at each load point ;)
> 
>> the test consumes all cpus/cores, what instrumentation are you viewing
>> that tells you the cpu utilization _isn't_ due to memory b/w starvation?
> 
> 1) profiling like 'perf top' or oprofile, using hardware counters to
> profile on cpu cycles, l1/l2 cache misses, etc
> 
> 2) the delayed logging code uses significantly more memory bandwidth
> than the original code because it copies changed information twice
> (instead of once) before it is written to disk. Given that single
> threaded performance of delayed logging is identical to the original
> code and scalability from 1 to 8 cores is almost linear, it cannot
> be memory bandwidth bound....
> 
> The code might be memory _latency_ bound (i.e on cache misses), but
> it is certainly not stressing pure memory bandwidth.
> 
>> A modern 64 bit 2 GHz core from AMD or Intel has an L1 instruction issue
>> rate of 8 bytes/cycle * 2,000 MHz = 16,000 MB/s = 16 GB/s per core.  An
>> 8 core machine would therefore have an instruction issue rate of 8 * 16
>> GB/s = 128 GB/s.  A modern dual socket system is going to top out at
>> 24-48 GB/s, well short of the instruction issue rate.  Now, this doesn't
>> even take the b/w of data load/store operations into account, but I'm
>> guessing the data size per directory operation is smaller than the total
>> instruction sequence, which operates on the same variable(s).
>>
>> So, if the CPUs are pegging, and we're not running out of memory b/w,
>> then this would lead me to believe that the hot kernel code, core
>> fs_mark code and the filesystem data are fully, or near fully, contained
>> in level 2 and 3 CPU caches.  Is this correct, more or less?
> 
> Probably.
> 
> However (and it is a big however!), I generally don't care to
> analyse performance at this level because it's getting into
> micro-optimisation territory. Sure, it will get you a few percent
> here and there, but then you lose focus on improving the algorithms.
> An algorithmic change can provide an order of magnitude improvement,
> not a few percent. The delayed logging code is a clear example of
> that.
> 
> Another example - perf top shows this on the above 8p load on
> a plain 2.6.36-rc3 kernel (and it gets about 40k files/s):
> 
>            426043.00 27.4% _xfs_buf_find
>             87491.00  5.6% __ticket_spin_lock
>             67204.00  4.3% xfs_dir2_node_addname
>             60434.00  3.9% dso__find_symbol
>             48407.00  3.1% kmem_cache_alloc
>             37006.00  2.4% __d_lookup
>             31625.00  2.0% xfs_trans_buf_item_match
>             20036.00  1.3% xfs_log_commit_cil
>             18728.00  1.2% _raw_spin_unlock_irqrestore
>             18428.00  1.2% __memset
>             18001.00  1.2% __memcpy
>             17781.00  1.1% xfs_da_do_buf
>             17732.00  1.1% xfs_iflush_cluster
>             16831.00  1.1% kmem_cache_free
>             14836.00  1.0% __kmalloc
> 
> It is clear that buffer lookup is consuming the most CPU of any
> operation. Why? Because the buffer hash table is too small. I've
> already posted patches for a short term solution (increase the size
> of the hash table) and the above 127k files/s result is using that
> patch. hence it is clear that the micro-optimisation works, but at
> the cost of 16x increase in memory usage for the hash table. And
> that still isn't really large enough, because now the load is
> already pushing the limits of the enlarged hash table.
> 
> As Christoph has already suggested, the correct way to fix the
> problem is to change the caching algorithm to something that is
> self-scaling (e.g. a rb-tree or a btree). That will keep memory
> usage low on small filesystems, yet scale efficiently to large
> numbers of buffers, something a hash cannot easily do.
> 
> IOWs, an algorithmic change will solve the problem far better for
> more situations than the micro-optimisation of tweaking the hash
> sizes. Reduced to the simplest argument, scalability is all
> about choosing the right algorithm so you don't have to care about
> minute details to obtain the performance you require.
> 
>> I'll have to dig around.  I've never even looked for the archives for
>> this list.  It's hopefully mirrored in the usual places.
>>
>> Out of curiosity, have you ever run into memory b/w starvation before
>> peaking all CPUs while running this test? 
> 
> No. Last time I ran out of bandwidth doing an IO workloads was doing
> 6-7GB/s of buffered writes to disk on a 24p ia64 Altix. The disk
> subsystem could handle 11GB/s, we got 10GB/s with direct IO, but
> buffered IO was limited by the cross-sectional memory bandwidth of
> the machine (25GB/s) because of the extra copy buffered IO
> requires....
> 
>> I could see that maybe
>> occurring with dual 1GHz+ P3 class systems with their smallish caches
>> and lowly single channel PC100, back before the switch to DDR memory,
>> but those machines were probably gone before XFS was open sourced, IIRC,
>> so you may not have had the pleasure (if you could call it that).
> 
> The ratio between CPU cycles and memory bandwidth really hasn't
> changed much since then. The CPUs weren't powerful enough then,
> either, to run enough metadata ops to get near memory bandwidth
> limits...
> 
> Cheers,
> 
> Dave.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs