Re: [LSF/MM TOPIC][ATTEND] Improving async io, specifically io_submit latencies

Kent Overstreet <koverstreet@xxxxxxxxxx> · Thu, 28 Feb 2013 13:03:18 -0800

On Fri, Mar 01, 2013 at 01:37:55AM +0530, Ankit Jain wrote:
> Hi,
> 
> I'm interested in discussing how to improve async io api in the kernel,
> specifically io_submit latencies.
> 
> I am working on trying to make io_submit non-blocking. I had posted a
> patch[1] for this earlier on fsdevel and there was some discussion on
> it. I have made some of the improvements suggested there.
> 
> The approach attempted in that patch essentially tries to service the
> requests on a separate kernel thread. It was pointed out that this would
> need to ensure that there aren't any unknown task_struct references or
> dependencies under f_op->aio* which might get confused because of the
> kernel thread. Would this kinda full audit be enough or would be it
> considered too fragile?

Was just talking about this.  Completely agreed that we need to do
something about it, but personally I don't think punting everything to
workqueue is a realistic solution.

One problem with the approach is that sometimes we _do_ need to block.
The primary reason we block in submit_bio if the request queue is too
full is that our current IO schedulers can't cope with unbounded queue
depth; other processes will be starved and see unbounded IO latencies.
This is even worse when a filesystem is involved and metadata operations
get stuck at the end of a huge queue.  By punting everything to
workqueue, all that's been accomplished is to hide the queueing and
shove it up a layer.

A similar problem exists with kernel memory usage, but it's even worse
there because most users aren't using memcg. If we're short on memery,
the processing doing aio really needs to be throttled in io_submit() ->
get_user_pages(); if it's punting everything to workqueue, now the other
processes may have to compete against 1000 worker threads calling
get_user_pages() simultaneously instead of just the process doing aio.

Also, punting everything to workqueue introduces a real performance
cost. Workqueues are fast, and it's not going to be noticed with hard
drives or even SATA SSDs - but high end SSDs are pushing over a million
iops these days and automatically punting everything to workqueue is
going to be unacceptable there.

That said, I think for filesystems blocking in get_blocks() another
kernel thread probably is only practical solution.

What I'd really like is a way to spawn a worker thread automagically
only if and when we block. The thought of trying to implement that
scares me though, I'm pretty sure it'd require deep magic.

In the short term though, Ted implemented a hack in ext4 to pin all
metadata for a given file in memory, and bumping up the request queue
depth shouldn't be a big deal if that's an issue (at least
configurably).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html