Re: truncate for block size > page size

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 22 May 2020 10:03:45 +1000

On Sun, May 17, 2020 at 02:54:07PM -0700, Matthew Wilcox wrote:
> I'm currently looking at the truncate path for large pages and I suspect
> you have thought about the situation with block size > page size more
> than I have.
> 
> Let's say you have a fs with 8kB blocks and a CPU with 4kB PAGE_SIZE.
> If you have a 32kB file with all its pages in the cache, and the user
> truncates it down to 10kB, should we leave three pages in the page cache
> or four?

Hmmm. I don't recall changing any of the truncate code in my
prototypes, and fsx worked just fine so having truncate cull the
page beyond EOF is just fine. One was posted here:

https://lore.kernel.org/linux-xfs/20181107063127.3902-1-david@xxxxxxxxxxxxx/

IMO, it doesn't matter if we don't zero entire blocks on truncate
here. the only place this zeroing matters is for mmap() when it
faults the EOF page and the range beyond EOF is exposed to
userspace. We need to ensure that part of the page is zero, but
we don't need to zero any further into the block. If the app writes
into that part of the page, then it gets zeroed at writeback time
anyway so the app data never hits the disk.

Keep in mind that we can do partial block IO for the EOF write - we
don't need to write out the entire block, just the pages that are
dirty before/over EOF. Also remember that if the page beyond EOF is
faulted even though the block is allocated, then the app will be
SIGBUS'd because mmap() cannot extend files.

The fundamental architectural principle I've been working from
is that block size > page size is entirely invisble to the page
cache. The mm and page cache just works on pages like it always has,
and the filesystem just does extra mapping via "zero-around" where
necessary to ensure stale data is not exposed in newly allocated
blocks and beyond EOF.

This "filesystem does IO mapping, mm does page cache mapping"
architecture is one of the reasons we introduced the iomap
infrastructure in the first place - stuff like filesystem block size
should not be known or assumed -anywhere- in the page cache or mm/
subsystem - the filesystem should just be instantiating pages for IO
across the range that it requires to be read or written. You can't
do this when driving IO from the page cache - the filesystem has to
map the range for the IO before the page cache is instantiated to
know what needs to be done for any given IO.

Hence if there isn't a page in the page cache over an extent in the
filesystem, the filesystem knows exactly how that page should be
instantiated for the operation being performed. Therefore it doesn't
matter if entire pages beyond EOF are truncated away without being
zeroed - a request to read or write that section of the block
be beyond EOF and so the filesystem will zero appropriately on read
or EOF extension on write at page cache instantiation time.

That's what the IOMAP_F_ZERO_AROUND functionality in the above
patchset was for - making sure page cache instantiation was done
correctly for all the different IO operations so that stale data was
never exposed to userspace. 6 billion fsx ops tends to find most
cache instantiation problems in the IO path :)

> Three pages means (if the last page of the file is dirty) we'd need to
> add in either a freshly allocated zero page or the generic zero page to
> the bio when writing back the last page.
>
> Four pages mean we'll need to teach the truncate code to use the larger
> of page size and block size when deciding the boundary to truncate the
> page cache to, and zero the last page(s) of the file if needed.

No. It's beyond EOF, so we have a clear mechanism for
zero-on-instantiation behaviour. We don't have to touch truncate at
all.

> Depending on your answer, I may have some follow-up questions about how
> we handle reading a 10kB file with an 8kB block size on a 4kB PAGE SIZE
> machine (whether we allocate 3 or 4 pages, and what we do about the
> extra known-to-be-zero bytes that will come from the device).

IOMAP_F_ZERO_AROUND handles all that.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx