Thanks Dave. I don't normally top post, but I just wanted to quickly say I _really_ enjoyed reading your reply below. It was seriously educational. I really enjoyed your note about the 24p Altix system. I've been a fan of SGI NUMA machines since the Origin 2k, due to the uniqueness of this scalable interconnect, though I've never been an SGI user. :( Keep up the great work, and keep us educated, as you've done so very well here. :) -- Stan Dave Chinner put forth on 9/2/2010 6:29 AM: > On Thu, Sep 02, 2010 at 03:41:59AM -0500, Stan Hoeppner wrote: >> Dave Chinner put forth on 9/2/2010 2:01 AM: >> >>> No, that's definitely not the case. A different kernel in the >>> same 8p VM, 12x2TB SAS storage, w/ 4 threads, mount options "logbsize=262144" >>> >>> FSUse% Count Size Files/sec App Overhead >>> 0 800000 0 39554.2 7590355 >>> >>> 4 threads with mount options "logbsize=262144,delaylog" >>> >>> FSUse% Count Size Files/sec App Overhead >>> 0 800000 0 67269.7 5697246 >> >> What happens when you bump each of these to 8 threads, 1 per core? If > > FSUse% Count Size Files/sec App Overhead > 0 1600000 0 127979.3 13156823 > > So, 1 thread does 19k files/s, 2 thread does 37k files/s, 4 gets > 67k, and 8 gets 128k. I'd say that's almost linear scaling and CPU > bound at each load point ;) > >> the test consumes all cpus/cores, what instrumentation are you viewing >> that tells you the cpu utilization _isn't_ due to memory b/w starvation? > > 1) profiling like 'perf top' or oprofile, using hardware counters to > profile on cpu cycles, l1/l2 cache misses, etc > > 2) the delayed logging code uses significantly more memory bandwidth > than the original code because it copies changed information twice > (instead of once) before it is written to disk. Given that single > threaded performance of delayed logging is identical to the original > code and scalability from 1 to 8 cores is almost linear, it cannot > be memory bandwidth bound.... > > The code might be memory _latency_ bound (i.e on cache misses), but > it is certainly not stressing pure memory bandwidth. > >> A modern 64 bit 2 GHz core from AMD or Intel has an L1 instruction issue >> rate of 8 bytes/cycle * 2,000 MHz = 16,000 MB/s = 16 GB/s per core. An >> 8 core machine would therefore have an instruction issue rate of 8 * 16 >> GB/s = 128 GB/s. A modern dual socket system is going to top out at >> 24-48 GB/s, well short of the instruction issue rate. Now, this doesn't >> even take the b/w of data load/store operations into account, but I'm >> guessing the data size per directory operation is smaller than the total >> instruction sequence, which operates on the same variable(s). >> >> So, if the CPUs are pegging, and we're not running out of memory b/w, >> then this would lead me to believe that the hot kernel code, core >> fs_mark code and the filesystem data are fully, or near fully, contained >> in level 2 and 3 CPU caches. Is this correct, more or less? > > Probably. > > However (and it is a big however!), I generally don't care to > analyse performance at this level because it's getting into > micro-optimisation territory. Sure, it will get you a few percent > here and there, but then you lose focus on improving the algorithms. > An algorithmic change can provide an order of magnitude improvement, > not a few percent. The delayed logging code is a clear example of > that. > > Another example - perf top shows this on the above 8p load on > a plain 2.6.36-rc3 kernel (and it gets about 40k files/s): > > 426043.00 27.4% _xfs_buf_find > 87491.00 5.6% __ticket_spin_lock > 67204.00 4.3% xfs_dir2_node_addname > 60434.00 3.9% dso__find_symbol > 48407.00 3.1% kmem_cache_alloc > 37006.00 2.4% __d_lookup > 31625.00 2.0% xfs_trans_buf_item_match > 20036.00 1.3% xfs_log_commit_cil > 18728.00 1.2% _raw_spin_unlock_irqrestore > 18428.00 1.2% __memset > 18001.00 1.2% __memcpy > 17781.00 1.1% xfs_da_do_buf > 17732.00 1.1% xfs_iflush_cluster > 16831.00 1.1% kmem_cache_free > 14836.00 1.0% __kmalloc > > It is clear that buffer lookup is consuming the most CPU of any > operation. Why? Because the buffer hash table is too small. I've > already posted patches for a short term solution (increase the size > of the hash table) and the above 127k files/s result is using that > patch. hence it is clear that the micro-optimisation works, but at > the cost of 16x increase in memory usage for the hash table. And > that still isn't really large enough, because now the load is > already pushing the limits of the enlarged hash table. > > As Christoph has already suggested, the correct way to fix the > problem is to change the caching algorithm to something that is > self-scaling (e.g. a rb-tree or a btree). That will keep memory > usage low on small filesystems, yet scale efficiently to large > numbers of buffers, something a hash cannot easily do. > > IOWs, an algorithmic change will solve the problem far better for > more situations than the micro-optimisation of tweaking the hash > sizes. Reduced to the simplest argument, scalability is all > about choosing the right algorithm so you don't have to care about > minute details to obtain the performance you require. > >> I'll have to dig around. I've never even looked for the archives for >> this list. It's hopefully mirrored in the usual places. >> >> Out of curiosity, have you ever run into memory b/w starvation before >> peaking all CPUs while running this test? > > No. Last time I ran out of bandwidth doing an IO workloads was doing > 6-7GB/s of buffered writes to disk on a 24p ia64 Altix. The disk > subsystem could handle 11GB/s, we got 10GB/s with direct IO, but > buffered IO was limited by the cross-sectional memory bandwidth of > the machine (25GB/s) because of the extra copy buffered IO > requires.... > >> I could see that maybe >> occurring with dual 1GHz+ P3 class systems with their smallish caches >> and lowly single channel PC100, back before the switch to DDR memory, >> but those machines were probably gone before XFS was open sourced, IIRC, >> so you may not have had the pleasure (if you could call it that). > > The ratio between CPU cycles and memory bandwidth really hasn't > changed much since then. The CPUs weren't powerful enough then, > either, to run enough metadata ops to get near memory bandwidth > limits... > > Cheers, > > Dave. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs