On Thu, Sep 02, 2010 at 03:41:59AM -0500, Stan Hoeppner wrote: > Dave Chinner put forth on 9/2/2010 2:01 AM: > > > No, that's definitely not the case. A different kernel in the > > same 8p VM, 12x2TB SAS storage, w/ 4 threads, mount options "logbsize=262144" > > > > FSUse% Count Size Files/sec App Overhead > > 0 800000 0 39554.2 7590355 > > > > 4 threads with mount options "logbsize=262144,delaylog" > > > > FSUse% Count Size Files/sec App Overhead > > 0 800000 0 67269.7 5697246 > > What happens when you bump each of these to 8 threads, 1 per core? If FSUse% Count Size Files/sec App Overhead 0 1600000 0 127979.3 13156823 So, 1 thread does 19k files/s, 2 thread does 37k files/s, 4 gets 67k, and 8 gets 128k. I'd say that's almost linear scaling and CPU bound at each load point ;) > the test consumes all cpus/cores, what instrumentation are you viewing > that tells you the cpu utilization _isn't_ due to memory b/w starvation? 1) profiling like 'perf top' or oprofile, using hardware counters to profile on cpu cycles, l1/l2 cache misses, etc 2) the delayed logging code uses significantly more memory bandwidth than the original code because it copies changed information twice (instead of once) before it is written to disk. Given that single threaded performance of delayed logging is identical to the original code and scalability from 1 to 8 cores is almost linear, it cannot be memory bandwidth bound.... The code might be memory _latency_ bound (i.e on cache misses), but it is certainly not stressing pure memory bandwidth. > A modern 64 bit 2 GHz core from AMD or Intel has an L1 instruction issue > rate of 8 bytes/cycle * 2,000 MHz = 16,000 MB/s = 16 GB/s per core. An > 8 core machine would therefore have an instruction issue rate of 8 * 16 > GB/s = 128 GB/s. A modern dual socket system is going to top out at > 24-48 GB/s, well short of the instruction issue rate. Now, this doesn't > even take the b/w of data load/store operations into account, but I'm > guessing the data size per directory operation is smaller than the total > instruction sequence, which operates on the same variable(s). > > So, if the CPUs are pegging, and we're not running out of memory b/w, > then this would lead me to believe that the hot kernel code, core > fs_mark code and the filesystem data are fully, or near fully, contained > in level 2 and 3 CPU caches. Is this correct, more or less? Probably. However (and it is a big however!), I generally don't care to analyse performance at this level because it's getting into micro-optimisation territory. Sure, it will get you a few percent here and there, but then you lose focus on improving the algorithms. An algorithmic change can provide an order of magnitude improvement, not a few percent. The delayed logging code is a clear example of that. Another example - perf top shows this on the above 8p load on a plain 2.6.36-rc3 kernel (and it gets about 40k files/s): 426043.00 27.4% _xfs_buf_find 87491.00 5.6% __ticket_spin_lock 67204.00 4.3% xfs_dir2_node_addname 60434.00 3.9% dso__find_symbol 48407.00 3.1% kmem_cache_alloc 37006.00 2.4% __d_lookup 31625.00 2.0% xfs_trans_buf_item_match 20036.00 1.3% xfs_log_commit_cil 18728.00 1.2% _raw_spin_unlock_irqrestore 18428.00 1.2% __memset 18001.00 1.2% __memcpy 17781.00 1.1% xfs_da_do_buf 17732.00 1.1% xfs_iflush_cluster 16831.00 1.1% kmem_cache_free 14836.00 1.0% __kmalloc It is clear that buffer lookup is consuming the most CPU of any operation. Why? Because the buffer hash table is too small. I've already posted patches for a short term solution (increase the size of the hash table) and the above 127k files/s result is using that patch. hence it is clear that the micro-optimisation works, but at the cost of 16x increase in memory usage for the hash table. And that still isn't really large enough, because now the load is already pushing the limits of the enlarged hash table. As Christoph has already suggested, the correct way to fix the problem is to change the caching algorithm to something that is self-scaling (e.g. a rb-tree or a btree). That will keep memory usage low on small filesystems, yet scale efficiently to large numbers of buffers, something a hash cannot easily do. IOWs, an algorithmic change will solve the problem far better for more situations than the micro-optimisation of tweaking the hash sizes. Reduced to the simplest argument, scalability is all about choosing the right algorithm so you don't have to care about minute details to obtain the performance you require. > I'll have to dig around. I've never even looked for the archives for > this list. It's hopefully mirrored in the usual places. > > Out of curiosity, have you ever run into memory b/w starvation before > peaking all CPUs while running this test? No. Last time I ran out of bandwidth doing an IO workloads was doing 6-7GB/s of buffered writes to disk on a 24p ia64 Altix. The disk subsystem could handle 11GB/s, we got 10GB/s with direct IO, but buffered IO was limited by the cross-sectional memory bandwidth of the machine (25GB/s) because of the extra copy buffered IO requires.... > I could see that maybe > occurring with dual 1GHz+ P3 class systems with their smallish caches > and lowly single channel PC100, back before the switch to DDR memory, > but those machines were probably gone before XFS was open sourced, IIRC, > so you may not have had the pleasure (if you could call it that). The ratio between CPU cycles and memory bandwidth really hasn't changed much since then. The CPUs weren't powerful enough then, either, to run enough metadata ops to get near memory bandwidth limits... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs