Re: [PATCH v3 56/68] afs: Handle len being extending over page end in write_begin/write_end

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 16 Dec 2021 20:20:35 +0000

On Thu, Dec 16, 2021 at 11:46:18AM -0800, Linus Torvalds wrote:
> On Thu, Dec 16, 2021 at 11:28 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >
> > Since ->write_begin is the place where we actually create folios, it
> > needs to know what size folio to create.  Unless you'd rather we do
> > something to actually create the folio before calling ->write_begin?
> 
> I don't think we can create a folio before that, because the
> filesystem may not even want a folio (think persistent memory or
> whatever).
> 
> Honestly, I think you need to describe more what you actually want to
> happen. Because generic_perform_write() has already decided to use a
> PAGE_SIZE by the time write_begin() is called,
> 
> Right now the world order is "we chunk things by PAGE_SIZE", and
> that's just how it is.

Right.  And we could leave it like that.  There's a huge amount of win
that comes from just creating large folios as part of readahead, and
anything we do for writes is going to be a smaller win.

That said, I would like it if a program which does:

fd = creat("foo", 0644);
write(fd, buf, 64 * 1024);
close(fd);

uses a single 64k page.

> I can see other options - like the filesystem passing in the chunk
> size when it calls generic_perform_write().

I'm hoping to avoid that.  Ideally filesystems don't know what the
"chunk size" is that's being used; they'll see a mixture of sizes
being used for any given file (potentially).  Depends on access
patterns, availability of higher-order memory, etc.

> Or we make the rule be that ->write_begin() simply always is given the
> whole area, and the filesystem can decide how it wants to chunk things
> up, and return the size of the write chunk in the status (rather than
> the current "success or error").

We do need to be slightly more limiting than "always gets the whole
area", because we do that fault_in_iov_iter_readable() call first,
and if the user has been mean and asked to write() 2GB of memory on
a (virtual) machine with 256MB, I'd prefer it if we didn't swap our way
through 2GB of address space before calling into ->write_begin.

> But at no point will this *EVER* be a "afs will limit the size to the
> folio size" issue. Nothing like that will ever make sense. Allowing
> bigger chunks will not be about any fscache issues, it will be about
> every single filesystem that uses generic_perform_write().

I agree that there should be nothing here that is specific to fscache.
David has in the past tried to convince me that he should always get
256kB folios, and I've done my best to explain that the MM just isn't
going to make that guarantee.

That said, this patch seems to be doing the right thing; it passes
the entire length into netfs_write_begin(), and is then truncating
the length to stop at the end of the folio that it got back.