Re: Options for improving f_fs.c performance?

Felipe Balbi <balbi@xxxxxxxxxx> · Fri, 21 Apr 2017 10:14:44 +0300

Hi Jerry,

Jerry Zhang <zhangjerry@xxxxxxxxxx> writes:
> (reposting this question in its own subject, instead of buried in a patchset)
>
> Android has migrated to using f_fs for the mtp function, but in the
> process we ran into some performance hiccups in comparison to f_mtp
> during bulk transfer of large files.
>
> Allocating memory at boot allowed f_mtp to use large buffers,
> around 250kB. Compared to using a 16 or 32 kB buffer, the larger buffer

you don't really need to allocate it at boot time :-) but fair
enough. Also, several (say 200) small buffers (PAGE_SIZE -> 4096 bytes)
would probably be better, but read on.

> is faster by almost 50%. f_fs can't use io sizes that large due
> to memory fragmentation, since there is a kmalloc on each call.

right

> One option here would be to add O_DIRECT as a flag to open
> (similar to what is discussed here:
> http://www.makelinux.net/ldd3/chp-15-sect-3).

We *could* use O_DIRECT, but I'm not sure that's what's causing
issues. I've been considering how to improve f_fs.c myself for the past
few months, but no time to do it yet.

> Then on each io, we could get_user_pages, vmap them together,

get_user_pages_fast() + sg_alloc_table_from_pages() would've been the
best. That way, there would be no copy_to/from_user needed, but that's
something that should happen later. There are lower hanging fruits in
f_fs.c IMO.

> and pass that address into ep_queue (for sync io only, non direct io
> does not change). This would would allow larger io sizes
> to be used and eliminates a user-kernel copy. Having the flag
> allows users to decide which endpoints they choose to use this way.

Correct :-)

> Does this sound like something that people want or find acceptable?

yes, it _is_ definitely acceptable. But f_fs.c needs a cleanup
first. Also, IMHO a better flag to implement would be O_NONBLOCK. Here's
how f_fs.c works today:

write(ep2, buf, length);
 ffs_epfile_write_iter()
  ffs_epfile_io()
   usb_ep_queue()
   wait_for_completion_interruptible()

That wait_for_completion_interruptible() is what's killing
performance. Each and every read/write waits for the USB side to
complete. It would've been much better to have something like:

if (flags & O_NONBLOCK)
	wait_for_completion_interruptible()

This would help the write() side of things by a long margin. For reads,
what we could do is have a kernel ring buffer with pre-allocated and
pre-queued usb_requests pointing to different pages in this ring
buffer. When a read() comes, instead of queueing the request right
there, we check if there's data in the internal ring buffer, if there
is, we just copy_to_user(), otherwise we either return 0 or return
-EAGAIN (depending on O_NONBLOCK).

After that, it becomes easy to implement poll() for endpoint files.

I really think this would give us better result in the end because the
real problem is with that wait_for_completion_interruptible(). If you
put a sniffer on the traffic, I'm sure you're gonna see several SOFs
going by without data just because of the latency of completing the
current request and, finally, returning control to userspace and so on.

> If there's other suggestions for fixing this issue, I'd be willing to help
> implement.

see above.

-- 
balbi
Attachment:
signature.asc

Description: PGP signature