On Fri, May 16, 2014 at 09:55:14AM +1000, Dave Chinner wrote: > On Thu, May 15, 2014 at 05:41:53PM +0200, Jan Kara wrote: > > Hello, > > > > so I was recently thinking about how writeback code shuffles inodes between > > lists and also how redirty_tail() clobbers dirtied_when timestamp (which broke > > my sync(2) optimization). This patch series came out of that. Patch 1 is a > > clear win and just needs an independent review that I didn't forget about > > something. Patch 3 changes writeback list handling - IMHO it makes the logic > > somewhat more straightforward as we don't have to bother shuffling inodes > > between lists and we also don't need to clobber dirtied_when timestamp. > > But opinions may differ... > > > > Patches passed xfstests run and I did some basic writeback tests using tiobench > > and some artifical sync livelock tests to verify nothing regressed. So I'd > > be happy if people could have a look. > > Performance regresses significantly. > > Test is on a 16p/16GB VM with a sparse 100TB XFS filesystem backed > by a pair of SSDs in RAID0: > > ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d > /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d > /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d > /mnt/scratch/6 -d /mnt/scratch/7 > > That creates 10 million 4k files with 16 threads and 10000 files per > directory. No sync/fsync is done, so it's a pure background > writeback workload. > > For 0-400,000 files, it runs in memory, at 400-800k files background > writeback is occurring, at > 800k files foreground throttling is > occurring. > > The file create rates and write IOPS/bw are: > > vanilla patched > load point files iops bw files iops bw > < bg thres 120k 0 0 110k 0 0 > < fg thres 120k 37k 210MB/s 60k 20k 130MB/s > sustained 36k 37k 210MB/s 25k 28k 150MB/s > > > The average create rate is 40k (vanilla) vs 28k (patched). Wall > times: > > vanilla patched > real 4m27.475s 6m29.364s > user 1m7.072s 1m3.590s > sys 10m0.836s 22m34.362s > > The new code burns more than twice the system CPU whilst going > significantly slower. OK, so it looks like we've got a case of lock contention: 48.37% [kernel] [k] _raw_spin_lock - _raw_spin_lock - 96.32% __mark_inode_dirty - 99.99% __set_page_dirty mark_buffer_dirty __block_commit_write.isra.25 block_write_end generic_write_end xfs_vm_write_end generic_perform_write xfs_file_buffered_aio_write xfs_file_aio_write do_sync_write vfs_write sys_write tracesys __GI___libc_write + 1.80% xfs_log_commit_cil Which is almost certainly the bdi->wb.list_lock. - 6.70% [kernel] [k] tag_for_io tag_for_io wb_writeback bdi_writeback_workfn process_one_work worker_thread kthread ret_from_fork But that's still a lot of CPU in tag_for_io(). Ah, that's the cause. tag_for_io() holds the bdi->wb.list_lock for the entire list walk tagging inodes. That explains why performance doesn't start to drop until writeback begins. Ouch: writeback_tag_for_io: bdi 253:32: older=4295532768 age=56 tagged=14092 reason=background writeback_tag_for_io: bdi 253:32: older=4295532808 age=56 tagged=13556 reason=background writeback_tag_for_io: bdi 253:32: older=4295532848 age=56 tagged=12387 reason=background writeback_tag_for_io: bdi 253:32: older=4295532888 age=60 tagged=11414 reason=background writeback_tag_for_io: bdi 253:32: older=4295532929 age=60 tagged=11128 reason=background writeback_tag_for_io: bdi 253:32: older=4295532970 age=60 tagged=10457 reason=background We're holding the log while we walk 10,000+ inodes at a time and the other 15 CPUs are spinning on that. List walking and tagging like this is never going to scale. I think we do need multiple lists, but not like we currently have. Right now the problem is that inodes of different dirtied_when sit on the same ordered list, and we need to find the first inode with a specific dirtied_when value within the list efficiently to be able to determine what inodes need to be written back. Seems to me that this problem has been solved elswhere in the kernel, like for tracking of timers to expire on a given jiffie (tvec, tvec_base, timer_lists). Perhaps we should be looking to move to a set of time based lists for efficiently tracking what inodes should be written back at a given background writeback interval rather than trying to keep everything on the one list. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html