On Thu, Jun 04, 2015 at 11:25:30AM +1000, Dave Chinner wrote: > On Thu, Jun 04, 2015 at 10:35:47AM +1000, Dave Chinner wrote: > > > - Trace cmd report > > > Too big to attach. Here's a link: > > > https://www.dropbox.com/s/3xxe2chsv4fsrv8/trace_report.txt.zip?dl=0 > > > > Downloading now. > > AIL pushing is occurring every 30s, yes. Across all filesystems, there > are roughly 23-25,000 metadata objects being pushed every 30s flush. ... > Indeed, this looks to me like the smoking gun. To allocate a block, > you have to lock the AGF buffer that the allocation is going to take > place in. Problem is, when the xfsaild pushes the AGF buffers to the > writeback queue, they sit there with the buffer locked until the IO > completes. > > In the traces, the xfsailds all run at 509385s, and immediately I > see a ~10s gap in the trace where almost no xfs_read_agf() traces > occur. It's not until 509396s that the traces really start to appear > at normal speed again. > > Again, reducing the number of AGs will help with this problem, > simply because the AG headers are more likely to be locked or > pinned when the xfsaild sweep runs because they are active rather > than sitting idle waiting for the next operation in that AG to > require allocation.... > > Remember, a single AG can sustain thousands of allocations every > second - if you are only creating a few thousand files every second, > you don't need tens of AGs to sustain that - the default of 4 AGs > will do that just fine... And in looking deeper into the issue, I think there's some code changes we need to make to minimise this issue. Allocation requires a locked AGF buffer, but they also need to be locked for IO. The underlying issue looks like we hold the lock for too long durign Io submission. i.e. a list gets passed to the delayed write submission code, which then walks the list locking the buffers, then we sort and issue the io on the list. If the writeback queue is long enough, submission is getting blocked on the request queue and we wait with locked buffers and hence don't allow modifications to take place on the buffers while we are waiting for submission. Fixing this requires a tweak to the algorithm in __xfs_buf_delwri_submit() so that we don't lock an entire list of thousands of IOs before starting submission. In the mean time, reducing the number of AGs will reduce the impact of this because the delayed write submission code will skip buffers that are already locked or pinned in memory, and hence an AG under modification at the time submission occurs will be skipped by the delwri code. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs