On Thu, Jun 04, 2015 at 10:35:47AM +1000, Dave Chinner wrote: > > - Trace cmd report > > Too big to attach. Here's a link: > > https://www.dropbox.com/s/3xxe2chsv4fsrv8/trace_report.txt.zip?dl=0 > > Downloading now. AIL pushing is occurring every 30s, yes. Across all filesystems, there are roughly 23-25,000 metadata objects being pushed every 30s flush. Think about that for a moment. You have a write once workload, so inode metadata is journalled and written only once. Hence if you are creating 1000 files/s, then you have at least 30,000 inodes to push every 30s. but that's not actually the big problem. Of the two ail push events in the trace, there are this many objects that we attempt to push: $ wc -l t.t 45149 t.t And this many inodes: $ grep INODE t.t | wc -l 11512 Now, XFS has inode clustering on writeback and that is active; it is reducing the number of inode IOs by a factor of roughly 10. So that means that every 30s, we've only got ~600 IOs across 8 disks to write back dirty inodes. i.e. less than a second worth of random IO. That's not the problem we are looking for. Buffers, OTOH: $ grep BUF t.t | wc -l 33637 So call it 17,000 every 30 seconds. That requires 17,000 4k IOs. Across 8 disks at 170 IOPS, that is *exactly* 12.5 seconds worth of IO. Looks to me like the buffers are mostly inode btree. free space btree and directory buffers. Directory buffers, well, that's where increasing the directory block size might help (e.g. to 8k). That may well reduce the number of directory buffers by more than a factor of 2 due to the structure of the directories. Depends on how many files you have in each directory.... The number of inode and alloc btree buffers can be reduced by reducing the number of AGs - probably by a factor of 10 by bringing the AG count down to 4. And, because the active inode and freespace btree buffers will be hotter, they are more likely just to be relogged than written back, further reducing IOs. Indeed, this looks to me like the smoking gun. To allocate a block, you have to lock the AGF buffer that the allocation is going to take place in. Problem is, when the xfsaild pushes the AGF buffers to the writeback queue, they sit there with the buffer locked until the IO completes. In the traces, the xfsailds all run at 509385s, and immediately I see a ~10s gap in the trace where almost no xfs_read_agf() traces occur. It's not until 509396s that the traces really start to appear at normal speed again. Again, reducing the number of AGs will help with this problem, simply because the AG headers are more likely to be locked or pinned when the xfsaild sweep runs because they are active rather than sitting idle waiting for the next operation in that AG to require allocation.... Remember, a single AG can sustain thousands of allocations every second - if you are only creating a few thousand files every second, you don't need tens of AGs to sustain that - the default of 4 AGs will do that just fine... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs