Re: xfsaild in D state seems to be blocking all other i/o sporadically

Michael Weissenbacher <mw@xxxxxxxxxxxx> · Thu, 20 Apr 2017 09:11:22 +0200



On 20.04.2017 01:48, Dave Chinner wrote:
> On Wed, Apr 19, 2017 at 11:47:43PM +0200, Michael Weissenbacher wrote:
>> OK, do i understand you correctly that the xfsaild does all the actual
>> work of updating the inodes? And it's doing that single-threaded,
>> reading both the log and the indoes themselves?
> 
> The problem is that the backing buffers that are used for flushing
> inodes have been reclaimed due to memory pressure, but the inodes in
> cache are still dirty. Hence to write the dirty inodes, we first
> have to read the inode buffer back into memory.
> 
Interesting find. Is there a way to prevent those buffers from getting
reclaimed? Would adjusting vfs_cache_pressure help? Or adding more
memory to the system? In fact the best thing would be to disable file
content caching completely. Because of the use-case (backup server) it's
worthless to cache file content.
My primary objective is to avoid those stalls and reduce latency, at the
expense of throughput.

> Essentially, the log pins more dirty inodes than than writeback
> can keep up with given the memory pressure that is occurring.
> 
>>
>>> ...500MB / 200 bytes per dirty inode record = ~2.5 million inodes.
>>> ~2.5m inodes / 10 minutes = ~4200 inodes per second...
>> Yes, these numbers sound plausible. When everything is running smoothly
>> (i.e. most of the time) i do see up to 4500 combined reads+writes per
>> second. But when the problem arises, all i see i xfsaild doing around
>> 120 reads/second, which is roughly the performance of a single 7200rpm
>> drive. Directly after this "read only phase" (which lasts for about 5-10
>> minutes most of the time) there is a short burst of up to 10000+
>> writes/sec (writing to the RAID controller cache I suppose).
> 
> Yes. Inode buffer reads from the xfsaild are sequential so block the
> aild and log tail pushing while they are on-going. It then queues
> the writes on a delayed write list and they are issued all in one
> go. Essentially, the system can dirty far more metadata in memory
> than it can effectively cache and flush to disk in a timely manner.
> 
>> And then
>> the system goes back to the "normal phase" where it is doing both reads
>> and writes and all those blocked D state processes continue working.
> 
> Yup, because the xfsaild has move the tail of the log, so there is
> now log space available and ongoing transactions can run at CPU
> speed again. The xfsaild will still be doing RMW inode writeback,
> but nothing is blocking waiting for it to reach it's target now...
> 
> I'd suggest that the only thing you can realistically do to mitigate
> this stall without changing the log size is change
> /proc/sys/fs/xfs/xfssyncd_centisecs down to one second so that the
> CIL is flushed to disk every second, and the AIL push target is
> updated immediately after the CIL flush is done. This will result in
> the xfsaild trying to push metadata almost as quickly as it is
> committed to the journal, hence it should be trying to flush inodes
> while th einode buffer is still cached in memory and hence avoid the
> read IO. This will help avoid AIL pushing from getting so far behind
> that inode buffers are reclaimed before the dirty inodes are
> written. It may not completely avoid the stalls, but it should make
> them less frequent...
> 
Ok, i just did that (set xfssyncd_centisecs=100). Will see if it helps.

thanks,
Michael
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html