Hi Brian,
Existing fuse implementation always processes direct IO
synchronously: it
submits next request to userspace fuse only when previous is
completed. This
is suboptimal because: 1) libaio DIO works in blocking way; 2)
userspace fuse
can't achieve parallelism processing several requests simultaneously
(e.g.
in case of distributed network storage); 3) userspace fuse can't merge
requests before passing it to actual storage.
The idea of the patch-set is to submit fuse requests in non-blocking way
(where it's possible) and either return -EIOCBQUEUED or wait for their
completion synchronously. The patch-set to be applied on top of
for-next of
Miklos' git repo.
To estimate performance improvement I used slightly modified fusexmp
over
tmpfs (clearing O_DIRECT bit from fi->flags in xmp_open). For
synchronous
operations I used 'dd' like this:
dd of=/dev/null if=/fuse/mnt/file bs=2M count=256 iflag=direct
dd if=/dev/zero of=/fuse/mnt/file bs=2M count=256 oflag=direct
conv=notrunc
For AIO I used 'aio-stress' like this:
aio-stress -s 512 -a 4 -b 1 -c 1 -O -o 1 /fuse/mnt/file
aio-stress -s 512 -a 4 -b 1 -c 1 -O -o 0 /fuse/mnt/file
The throughput on some commodity (rather feeble) server was (in MB/sec):
original / patched
dd reads: ~322 / ~382
dd writes: ~277 / ~288
aio reads: ~380 / ~459
aio writes: ~319 / ~353
Thanks for posting this, first of all. I was reading through the code
and found some of the logic a bit hard to follow, particularly due to
controlling the behavior of underlying read/write functions based on the
'async' kiocb. For example: conditionally updating inode size in a
couple different places; submitting sync requests async and waiting on
them (causing the need to duplicate the update size call); and the
existence of a kiocb called 'async' (when a kiocb can be either sync or
async).
Anyways, I couldn't quite put my finger on how to potentially clean that
up without messing around with the code, so I've inlined the diff that I
came up with that (IMO) cleans things up a bit. This should apply on top
of your set and makes the following tweaks:
- Always pass a kiocb down through __fuse_direct_write()/read() (instead
of struct file).
- Trigger the "async" behavior in fuse_direct_io() and
fuse_send_read/write() based on the sync/async nature of the kiocb. This
means that sync requests and async requests are sent as such based on
the nature of the kiocb (as opposed to whether a kiocb exists).
- Update the various other callers of __fuse_direct_write()/read and
fuse_direct_io() to create and pass a sync kiocb where necessary.
- Use the same approach in fuse_direct_IO() to turn an extending, async
write into a sync write.
The tradeoff with this approach is that we slightly uglify the callers
that have to now create a sync kiocb. That said, I think some of those
codepaths (i.e., fuse_file_aio_write()->fuse_perform_write()->...) could
now just pass the kiocb from the vfs straight down. Thoughts?
Thanks a lot for review and further efforts. Your clean-up patch looks
fine, but it constrains the scope of patch-set too much: ordinary
synchronous dio-s (generated by read(2) and write(2)) cease to be
optimized. I think such a loss doesn't justify the clean-up. To be sure
that we are on the same page, I'll try to explain this optimization in
more details:
Imagine direct read or write of 1MB size. Having
FUSE_MAX_PAGES_PER_REQ==32, this can be fulfilled by 8 fuse requests.
Generic linux-kernel (e.g. generic_file_aio_read) passes synchronous
iocb to fuse_direct_IO(). From fuse perspective, it only means that the
caller expects to get control back when that 1MB request is completed.
It's up to fuse how to process that 1MB request.
Existing fuse implementation (w/o my patch-set) processes it like this:
- allocate, fill and send 1st fuse request, wait for its completion
(ACK-ing by userspace);
- allocate, fill and send 2nd fuse request, wait for its completion
(ACK-ing by userspace);
- ...
- allocate, fill and send 8th fuse request, wait for its completion
(ACK-ing by userspace);
With my patch-set applied, it becomes like this:
- allocate, fill and send 1st fuse request;
- allocate, fill and send 2nd fuse request;
- ...
- allocate, fill and send 8th fuse request;
- wait till all of them are completed (ACK-ed by userspace)
This is significant optimization for non-trivial fuse userspace
implementations because: 1) fuse requests might be processed
concurrently; 2) fuse requests might be merged before processing.
Now, let's proceed to technical details... Your idea to use iocb
everywhere instead of file and rely on its type (sync vs. async) is
cool. I like it! Unfortunately, to make it usable in both libaio and
sync dio cases we'll need to put our hands on generic linux-kernel code
as well. Let me explain why:
Currently, iocb has only one binary attribute - it's either sync or
async. aio_complete() derives its ancestry (libaio vs. read(2)/write(2))
from its sync/async type:
if (is_sync_kiocb(iocb)) {
BUG_ON(iocb->ki_users != 1);
iocb->ki_user_data = res;
iocb->ki_users = 0;
wake_up_process(iocb->ki_obj.tsk);
return 1;
}
In the other words, if it's sync, it's enough to wake up someone in
kernel (who called wait_on_sync_kiocb()) and return. Otherwise,
aio_complete() performs some steps to wake up user (who called
io_getevents()). What we need for fuse is another binary attribute: iocb
was generated by libaio vs. read(2)/write(2). Then we could use
aio_complete() and wait_on_sync_kiocb() for any iocb which was not
generated by libaio. E.g. to process sync dio in async way, we'd
allocate iocb in fuse_direct_IO on stack (as you suggested), but
initialize it as async iocb and pass it to __fuse_direct_read/write,
then call wait_on_sync_kiocb(), not to return -EIOCBQUEUED.
The problem is that FUSE tends to be the only user of this 'feature'.
I'm considering one-line change of aio_complete():
- if (is_sync_kiocb(iocb)) {
+ if (!iocb->ki_ctx) {
Though, not sure whether Miklos, Al Viro, Benjamin and others will
accept it... Let's try if you're OK about this approach. At least this
will cover all fuse dio use-cases (including libaio extending file) and
makes fuse code significantly cleaner.
Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html