Re: xfsaild in D state seems to be blocking all other i/o sporadically

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 20 Apr 2017 09:48:38 +1000

On Wed, Apr 19, 2017 at 11:47:43PM +0200, Michael Weissenbacher wrote:
> OK, do i understand you correctly that the xfsaild does all the actual
> work of updating the inodes? And it's doing that single-threaded,
> reading both the log and the indoes themselves?

The problem is that the backing buffers that are used for flushing
inodes have been reclaimed due to memory pressure, but the inodes in
cache are still dirty. Hence to write the dirty inodes, we first
have to read the inode buffer back into memory.

Essentially, the log pins more dirty inodes than than writeback
can keep up with given the memory pressure that is occurring.

> 
> > ...500MB / 200 bytes per dirty inode record = ~2.5 million inodes.
> > ~2.5m inodes / 10 minutes = ~4200 inodes per second...
> Yes, these numbers sound plausible. When everything is running smoothly
> (i.e. most of the time) i do see up to 4500 combined reads+writes per
> second. But when the problem arises, all i see i xfsaild doing around
> 120 reads/second, which is roughly the performance of a single 7200rpm
> drive. Directly after this "read only phase" (which lasts for about 5-10
> minutes most of the time) there is a short burst of up to 10000+
> writes/sec (writing to the RAID controller cache I suppose).

Yes. Inode buffer reads from the xfsaild are sequential so block the
aild and log tail pushing while they are on-going. It then queues
the writes on a delayed write list and they are issued all in one
go. Essentially, the system can dirty far more metadata in memory
than it can effectively cache and flush to disk in a timely manner.

> And then
> the system goes back to the "normal phase" where it is doing both reads
> and writes and all those blocked D state processes continue working.

Yup, because the xfsaild has move the tail of the log, so there is
now log space available and ongoing transactions can run at CPU
speed again. The xfsaild will still be doing RMW inode writeback,
but nothing is blocking waiting for it to reach it's target now...

I'd suggest that the only thing you can realistically do to mitigate
this stall without changing the log size is change
/proc/sys/fs/xfs/xfssyncd_centisecs down to one second so that the
CIL is flushed to disk every second, and the AIL push target is
updated immediately after the CIL flush is done. This will result in
the xfsaild trying to push metadata almost as quickly as it is
committed to the journal, hence it should be trying to flush inodes
while th einode buffer is still cached in memory and hence avoid the
read IO. This will help avoid AIL pushing from getting so far behind
that inode buffers are reclaimed before the dirty inodes are
written. It may not completely avoid the stalls, but it should make
them less frequent...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html