[XFS SUMMIT] AIL Behavioural Issues

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 13 May 2020 13:42:21 +1000

Topic:	AIL Behavioural Issues

Scope:
	Performance
	Efficiency
	Scalability

Proposal:

The XFS AIL numerous issues on high performance storage and
highly concurrent workloads. These can be broken down into several
categories:

1. Tracks too many redundant objects

We track inodes and dquots individually in the AIL, but we push
them as groups based on the backing buffer for the objects. For
inodes, this means we can have 32 inode items in the AIL for a
single buffer. We push the first inode and queue the buffer to the
delwri list, then we have to skip the other 31 inodes on that buffer
in the AIL.

This typically shows up as having a "success to flushing" ratio of
~30:1 in the AIL stats. We are doing 30x more work that we need to
inserting, iterating, and removing inodes in the AIL than required
to flush them. dquots have a similar problem.

So the question becomes this: how do we efficiently track related
objects that share a common backing object through the AIL? For
inodes, I'm attempting to do this with ordered buffers, but that is
proving to have a lot of lock ordering issues associated with it.

Perhaps there is a better way? Discuss.

2. Creates massive IO lists and so iteration latency can be multiple
  seconds as IO submission blocks

When tail pushing, the AIL will just keep building a delayed write
buffer list as long as it doesn't get stuck or hit it's target. This
can result in a submission list tens of thousands of buffers long
and hence the IO submission process ends up blocking on full request
queues and being throttled to the rate at which IO completions
occur. On slow storage, this can result in the AIL being blocked for
seconds (or more) while it works through the submission list.

If the item that pins the tail of the log is at the end of the
submission list, then the transaction reservations also block
waiting for the tail of the log to move, and hence everything in the
filesystem stops until submissions and IO completes. Then the AIL
does another scan, creates another huge list, and blocks again....

The long lists have the advantage that sorting them allows good IO
merging to take place. This reduces the IO time for slow disks
because it reduces seeks, but then the long list still takes a long
time to work through. It's a catch-22 situation.

Several things to discuss:
	- shorter lists to reduce submission latency
	- 2-dimensional AIL lists
		- existing LSN ordered list
		- additional daddr ordered list
	- no sorting?
	- immediate dispatch rather than delwri lists?
	- IO submission in a separate thread

3. Can block on CIL pushing

When the AIL hits pinned items, it will eventually issue a log force
to try to get them unpinned and so make progress. It does this when
it reaches it's target and there pinned objects that couldn't be
pushed prior to reaching the target.

The log force, while not a synchronous log force, still ends up
blocking on the CIL push to flush everything to the iclogs. This,
unfortunately, typically results in the CIL push blocking waiting
for iclog submission and IO completion as the amount of dirty
metadata in the CIL is greater than iclogs * iclogs_size. Hence tail
pushing can actually block behind log writes rather than just
triggering a CIL push to occur in the background and continuing to
scan for items to push.

discussion points:
	- do we even care if we hit pinned buffers in the AIL push
	  anymore?
	- do we actually need to wait for the CIL flush in the
	  AIL code?
	- do any async log forces need to wait for the CIL flush?
	- do we need a separate "background CIL push" API for
	  the AIL to use?

4. Single threaded

Because the AIL is single threaded, there is a hard limit on the
amount of scanning and IO submission work it can do, and any place
it blocks takes away from it's capacity to flush dirty metadata to
disk.

Discussion points:
	- how long can we get away with efficiency improvements to
	  keep ahead of tens to hundreds of CPUs dirtying metadata?
	- how do break up multi-threaded pushing?
	- can we use multiple AILs (e.g. 1 per AG) and co-ordinate
	  flush targets across them?
	- if we can parallelise the AIL, how much do we care about
	  other improvements?

5. Does not control it's own activity

Typically the AIL is invoked to push on dirty metadata items when
the log reservations pass 75% of the size of the log. This works
great when the log is small because we hit the threshold quickly and
frequently and so the AIL is essentially being told to "just keep
working".

However, when we have large journals, 75% of the log can be 1.5GB of
space, and so it can take some time for the AIL to be woken to do
work. If lots of modifications are taking place, then the log can
run out of space before the AIL has taken any significant action to
start writeback, and because there are so many objects in the AIL it
struggles initially to reach a steady state where tail pushing keeps
enough log space free that incoming transaction reservations don't
have to use the slow wait path.

To make matters more complex, other events will cause the AIL to try
to push everything to disk. Every 30s the log worker comes along and
tells the AIL to "push everything". And every time memory reclaim
runs the shrinker tells the AIL to "push everything". This often
cause the AIL to walk into recently committed and relogged objects
to try to push them, often resulting finding pinned objects and
hence issuing frequent (and unnecessary) log forces to try to unpin
them.

Hence under heavy loads, the AIL really ends up just trying to push
everything all the time, and that ends up exacerbating the
inefficiencies and blocking issues that result from long IO lists
and log forces.

The question I'm asking here is whether this should be turned upside
down. That is, rather than have random callers tell the AIL what it
should do, the AIL itself should decide when are where to push
items. The AIL knows exactly how much space is active in the log -
the head and tail of the AIL define the on-disk log head and tail -
nad when combined with visibility of the amount of unused
reservation space we have a pretty good idea of:

	a) how quickly pushing and IO completions is moving the
	   tail forwards;
	b) how quicly the head is being moved forwards by the
	   workload committing transactions; and
	c) how much demand there is for log space by the workload
	   transaction reservation rate.

>From this information we should be able to determine how much work
we need to do, and how long we can leave items in the AIL before we
need to flush them. If there is no reservation pressure, we don't
need to rush to push the tail of the log. If we are moving the tail
of the log faster than the workload is moving the head, then we
don't have to flush the AIL flat out all the time. And so on.

So this is a discussion opportunity to design a better tail pushing
algorithm that is not dependent on being told what to do, but
instead use observation of the rate of change of critical metrics to
determine the rate at which we need to do push work....

-- 
Dave Chinner
david@xxxxxxxxxxxxx