On Fri 16-05-14 10:47:54, Dave Chinner wrote: > On Fri, May 16, 2014 at 09:55:14AM +1000, Dave Chinner wrote: > > On Thu, May 15, 2014 at 05:41:53PM +0200, Jan Kara wrote: > > > Hello, > > > > > > so I was recently thinking about how writeback code shuffles inodes between > > > lists and also how redirty_tail() clobbers dirtied_when timestamp (which broke > > > my sync(2) optimization). This patch series came out of that. Patch 1 is a > > > clear win and just needs an independent review that I didn't forget about > > > something. Patch 3 changes writeback list handling - IMHO it makes the logic > > > somewhat more straightforward as we don't have to bother shuffling inodes > > > between lists and we also don't need to clobber dirtied_when timestamp. > > > But opinions may differ... > > > > > > Patches passed xfstests run and I did some basic writeback tests using tiobench > > > and some artifical sync livelock tests to verify nothing regressed. So I'd > > > be happy if people could have a look. > > > > Performance regresses significantly. > > > > Test is on a 16p/16GB VM with a sparse 100TB XFS filesystem backed > > by a pair of SSDs in RAID0: > > > > ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d > > /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d > > /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d > > /mnt/scratch/6 -d /mnt/scratch/7 > > > > That creates 10 million 4k files with 16 threads and 10000 files per > > directory. No sync/fsync is done, so it's a pure background > > writeback workload. > > > > For 0-400,000 files, it runs in memory, at 400-800k files background > > writeback is occurring, at > 800k files foreground throttling is > > occurring. > > > > The file create rates and write IOPS/bw are: > > > > vanilla patched > > load point files iops bw files iops bw > > < bg thres 120k 0 0 110k 0 0 > > < fg thres 120k 37k 210MB/s 60k 20k 130MB/s > > sustained 36k 37k 210MB/s 25k 28k 150MB/s > > > > > > The average create rate is 40k (vanilla) vs 28k (patched). Wall > > times: > > > > vanilla patched > > real 4m27.475s 6m29.364s > > user 1m7.072s 1m3.590s > > sys 10m0.836s 22m34.362s > > > > The new code burns more than twice the system CPU whilst going > > significantly slower. > > OK, so it looks like we've got a case of lock contention: > > 48.37% [kernel] [k] _raw_spin_lock > - _raw_spin_lock > - 96.32% __mark_inode_dirty > - 99.99% __set_page_dirty > mark_buffer_dirty > __block_commit_write.isra.25 > block_write_end > generic_write_end > xfs_vm_write_end > generic_perform_write > xfs_file_buffered_aio_write > xfs_file_aio_write > do_sync_write > vfs_write > sys_write > tracesys > __GI___libc_write > + 1.80% xfs_log_commit_cil > > Which is almost certainly the bdi->wb.list_lock. > > - 6.70% [kernel] [k] tag_for_io > tag_for_io > wb_writeback > bdi_writeback_workfn > process_one_work > worker_thread > kthread > ret_from_fork > > But that's still a lot of CPU in tag_for_io(). > > Ah, that's the cause. tag_for_io() holds the bdi->wb.list_lock > for the entire list walk tagging inodes. That explains why > performance doesn't start to drop until writeback begins. Ouch: > > writeback_tag_for_io: bdi 253:32: older=4295532768 age=56 tagged=14092 reason=background > writeback_tag_for_io: bdi 253:32: older=4295532808 age=56 tagged=13556 reason=background > writeback_tag_for_io: bdi 253:32: older=4295532848 age=56 tagged=12387 reason=background > writeback_tag_for_io: bdi 253:32: older=4295532888 age=60 tagged=11414 reason=background > writeback_tag_for_io: bdi 253:32: older=4295532929 age=60 tagged=11128 reason=background > writeback_tag_for_io: bdi 253:32: older=4295532970 age=60 tagged=10457 reason=background > > We're holding the log while we walk 10,000+ inodes at a time and > the other 15 CPUs are spinning on that. > > List walking and tagging like this is never going to scale. I think > we do need multiple lists, but not like we currently have. Right now > the problem is that inodes of different dirtied_when sit on the same > ordered list, and we need to find the first inode with a specific > dirtied_when value within the list efficiently to be able to > determine what inodes need to be written back. Thanks for the data and the analysis. I'm aware of this problem but previously we walked the list in move_expired_inodes() so I'd expect my patches didn't make this any worse... After some thought I guess the inode tagging with i_lock acquisition ends up being more expensive than the plain list_move() we did previously. > Seems to me that this problem has been solved elswhere in the > kernel, like for tracking of timers to expire on a given jiffie > (tvec, tvec_base, timer_lists). Perhaps we should be looking to move > to a set of time based lists for efficiently tracking what inodes > should be written back at a given background writeback interval > rather than trying to keep everything on the one list. Thanks for the idea. I'll think about this - the trouble with writeback lists is that we either need to select based on time, or based on superblock (depending on the type of writeback). But we surely can do better than now :) Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html