On Fri, Mar 01, 2013 at 01:37:55AM +0530, Ankit Jain wrote: > Hi, > > I'm interested in discussing how to improve async io api in the kernel, > specifically io_submit latencies. > > I am working on trying to make io_submit non-blocking. I had posted a > patch[1] for this earlier on fsdevel and there was some discussion on > it. I have made some of the improvements suggested there. > > The approach attempted in that patch essentially tries to service the > requests on a separate kernel thread. It was pointed out that this would > need to ensure that there aren't any unknown task_struct references or > dependencies under f_op->aio* which might get confused because of the > kernel thread. Would this kinda full audit be enough or would be it > considered too fragile? Was just talking about this. Completely agreed that we need to do something about it, but personally I don't think punting everything to workqueue is a realistic solution. One problem with the approach is that sometimes we _do_ need to block. The primary reason we block in submit_bio if the request queue is too full is that our current IO schedulers can't cope with unbounded queue depth; other processes will be starved and see unbounded IO latencies. This is even worse when a filesystem is involved and metadata operations get stuck at the end of a huge queue. By punting everything to workqueue, all that's been accomplished is to hide the queueing and shove it up a layer. A similar problem exists with kernel memory usage, but it's even worse there because most users aren't using memcg. If we're short on memery, the processing doing aio really needs to be throttled in io_submit() -> get_user_pages(); if it's punting everything to workqueue, now the other processes may have to compete against 1000 worker threads calling get_user_pages() simultaneously instead of just the process doing aio. Also, punting everything to workqueue introduces a real performance cost. Workqueues are fast, and it's not going to be noticed with hard drives or even SATA SSDs - but high end SSDs are pushing over a million iops these days and automatically punting everything to workqueue is going to be unacceptable there. That said, I think for filesystems blocking in get_blocks() another kernel thread probably is only practical solution. What I'd really like is a way to spawn a worker thread automagically only if and when we block. The thought of trying to implement that scares me though, I'm pretty sure it'd require deep magic. In the short term though, Ted implemented a hack in ext4 to pin all metadata for a given file in memory, and bumping up the request queue depth shouldn't be a big deal if that's an issue (at least configurably). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html