On Wed, Nov 17, 2010 at 06:09:12PM -0800, Andrew Morton wrote: > On Thu, 18 Nov 2010 13:06:40 +1100 Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > On Wed, Nov 17, 2010 at 03:03:30PM -0800, Andrew Morton wrote: > > > On Wed, 17 Nov 2010 12:27:20 +0800 > > > Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: > > > > > > > On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and > > > > improves IO throughput from 38MB/s to 42MB/s. > > > > > > The changes in CPU consumption are remarkable. I've looked through the > > > changelogs but cannot find mention of where all that time was being > > > spent? > > > > In the writeback path, mostly because every CPU is trying to run > > writeback at the same time and causing contention on locks and > > shared structures in the writeback path. That no longer happens > > because writeback is only happening from one thread instead of from > > all CPUs at once. > > It'd be nice to see this quantified. Partly because handing things > over to kernel threads uncurs extra overhead - scheduling cost and CPU > cache footprint. Sure, but in this case, the scheduling cost is much lower than actually doing writeback of 1500 pages. The CPU cache footprint of the syscall is also greatly reduced as well because we don't go down the writeback path. That shows up in the fact that the "app overhead" measured by fs_mark goes down significantly with this patch series (30-50% reduction) - it's doing the same work, but it's taking much less wall time.... And if you are after lock contention numbers, I have quantified it though I do not have saved lock_stat numbers at hand. Running the current inode_lock breakup patchset and the fs_mark workload (8-way parallel create of 1 byte files), lock_stat shows the inode_wb_list_lock as the hottest lock in the system (more trafficed and much more contended than the dcache_lock), along with the inode->i_lock being the most trafficed. Running `perf top -p <pid of bdi-flusher>` showed it spending 30-40% of it's time in __ticket_spin_lock. I saw the same thing with every fs_mark process also showing 30-40% of it's time in __ticket_spin_lock. Every process also showed a good chunk of time in the writeback path. Overall, the fsmark processes showed a CPU consumption of ~620% CPU, with the bdi-flusher at 80% of a CPU and kswapd at 80% of CPU. With the patchset, all that spin lock time is gone from the profiles (down to about 2%) as is the writeback path (except fo the bdi-flusher, which is all writeback path). Overall, we have fsmark processes showing 250% CPU, the bdi-flusher at 80% of a cpu, and kswapd at about 20% of a CPU, with over 400% idle time. IOWs, we've traded off 3-4 CPUs worth of spinlock contention and a flusher thread running at 80% CPU for a flusher thread that runs at 80% CPU doing the same amount of work. To me, that says the cost of scheduling is orders of magnitude lower than the cost of the current code... > But mainly because we're taking the work accounting away from the user > who caused it and crediting it to the kernel thread instead, and that's > an actively *bad* thing to do. The current foreground writeback is doing work on behalf of the system (i.e. doing background writeback) and therefore crediting it to the user process. That seems wrong to me; it's hiding the overhead of system tasks in user processes. IMO, time spent doing background writeback should not be creditted to user processes - writeback caching is a function of the OS and it's overhead should be accounted as such. Indeed, nobody has realised (until now) just how inefficient it really is because of the fact that the overhead is mostly hidden in user process system time. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html