Re: Buffered I/O broken on s390x with page faults disabled (gfs2)

Filipe Manana <fdmanana@xxxxxxxxxx> · Thu, 10 Mar 2022 12:13:25 +0000

On Wed, Mar 09, 2022 at 10:08:32PM +0100, Andreas Gruenbacher wrote:
> On Wed, Mar 9, 2022 at 8:08 PM Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > On Wed, Mar 9, 2022 at 10:42 AM Andreas Gruenbacher <agruenba@xxxxxxxxxx> wrote:
> > > With a large enough buffer, a simple malloc() will return unmapped
> > > pages, and reading into such a buffer will result in fault-in.  So page
> > > faults during read() are actually pretty normal, and it's not the user's
> > > fault.
> >
> > Agreed. But that wasn't the case here:
> >
> > > In my test case, the buffer was pre-initialized with memset() to avoid
> > > those kinds of page faults, which meant that the page faults in
> > > gfs2_file_read_iter() only started to happen when we were out of memory.
> > > But that's not the common case.
> >
> > Exactly. I do not think this is a case that we should - or need to -
> > optimize for.
> >
> > And doing too much pre-faulting is actually counter-productive.
> >
> > > * Get rid of max_size: it really makes no sense to second-guess what the
> > >   caller needs.
> >
> > It's not about "what caller needs". It's literally about latency
> > issues. If you can force a busy loop in kernel space by having one
> > unmapped page and then do a 2GB read(), that's a *PROBLEM*.
> >
> > Now, we can try this thing, because I think we end up having other
> > size limitations in the IO subsystem that means that the filesystem
> > won't actually do that, but the moment I hear somebody talk about
> > latencies, that max_size goes back.
> 
> Thanks, this puts fault_in_safe_writeable() in line with
> fault_in_readable() and fault_in_writeable().
> 
> There currently are two users of
> fault_in_safe_writeable()/fault_in_iov_iter_writeable(): gfs2 and
> btrfs.
> In gfs2, we cap the size at BIO_MAX_VECS pages (256). I don't see an
> explicit cap in btrfs; adding Filipe.

On btrfs, for buffered writes, we have some cap (done at btrfs_buffered_write()),
for buffered reads we don't have any control on that as we use filemap_read().

For direct IO we don't have any cap, we try to fault in everything that's left.
However we keep track if we are doing any progress, and if we aren't making any
progress we just fall back to the buffered IO path. So that prevents infinite
or long loops.

There's really no good reason to not cap how much we try to fault in in the
direct IO paths. We should do it, as it probably has a negative performance
impact for very large direct IO reads/writes.

Thanks.

> 
> Andreas
>