On Tue, Jul 24, 2012 at 06:04:23PM +0530, Rajat Sharma wrote: > > > > Currently, io_submit tries to execute the io requests on the > > same thread, which could block because of various reaons (eg. > > allocation of disk blocks). So, essentially, io_submit ends > > up being a blocking call. > > Ideally filesystem should take care of it e.g. by deferring such time > consuming allocations and return -EIOCBQUEUED immediately. But have > you seen such cases? Oh, it happens all the time if you are using AIO. If the file system needs to read or write any metadata block, AIO can become distinctly non-"A". The workaround that I've chosen is to create a way to cache the information needed for the bmap() operation, triggered via an ioctl() issued at open time, so that this is not an issue, but that only works if the file is pre-allocated, and there is no need to do any block allocations. It's all very well and good to say, "the file system should handle it", but that just pushes the problem onto the file system. And since you need to potentially issue block I/O requests, which you can't do from an interrupt context (i.e., a block I/O completion handler), you really do need to create a workqueue in order to make things work. If you do it in the fs/direct_io.c layer, at least that way you can solve the problem once for all file systems.... > With lots of application threads firing continuous IOs, workqueue > threads might become bottleneck and you might have to eventually > develop a priority scheduling. This workqueue was originally designed > for IO retries which is an error path, now consumers of workqueue > might easily increase by 100x. Yes, you definitely need to throttle how many outstanding AIO's can be allowed to be outstanding, either globally, or on a per-superblock/process/user/cgroup basis, and return EAGAIN if there are too many outstanding requests. Speaking of cgroups, one of the other challenges with running the AIO out of a workqueue is trying to respect cgroup restrictions. In particular, the io-throttle cgroup (which is needed to provide Proportional I/O support), but also the memory cgroup. All of these complications is why I decided to simply go with the "pin metadata" approach, since I didn't need to worry (at least initially) with the allocating write case. (These patches to ext4 haven't yet been published upstream, mainly because they need a lot of cleanup work and I haven't had time to do that cleanup; my intention is to get the "big extents" patchset upstream, though.) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html