On Sun, May 17, 2020 at 02:54:07PM -0700, Matthew Wilcox wrote: > I'm currently looking at the truncate path for large pages and I suspect > you have thought about the situation with block size > page size more > than I have. > > Let's say you have a fs with 8kB blocks and a CPU with 4kB PAGE_SIZE. > If you have a 32kB file with all its pages in the cache, and the user > truncates it down to 10kB, should we leave three pages in the page cache > or four? Hmmm. I don't recall changing any of the truncate code in my prototypes, and fsx worked just fine so having truncate cull the page beyond EOF is just fine. One was posted here: https://lore.kernel.org/linux-xfs/20181107063127.3902-1-david@xxxxxxxxxxxxx/ IMO, it doesn't matter if we don't zero entire blocks on truncate here. the only place this zeroing matters is for mmap() when it faults the EOF page and the range beyond EOF is exposed to userspace. We need to ensure that part of the page is zero, but we don't need to zero any further into the block. If the app writes into that part of the page, then it gets zeroed at writeback time anyway so the app data never hits the disk. Keep in mind that we can do partial block IO for the EOF write - we don't need to write out the entire block, just the pages that are dirty before/over EOF. Also remember that if the page beyond EOF is faulted even though the block is allocated, then the app will be SIGBUS'd because mmap() cannot extend files. The fundamental architectural principle I've been working from is that block size > page size is entirely invisble to the page cache. The mm and page cache just works on pages like it always has, and the filesystem just does extra mapping via "zero-around" where necessary to ensure stale data is not exposed in newly allocated blocks and beyond EOF. This "filesystem does IO mapping, mm does page cache mapping" architecture is one of the reasons we introduced the iomap infrastructure in the first place - stuff like filesystem block size should not be known or assumed -anywhere- in the page cache or mm/ subsystem - the filesystem should just be instantiating pages for IO across the range that it requires to be read or written. You can't do this when driving IO from the page cache - the filesystem has to map the range for the IO before the page cache is instantiated to know what needs to be done for any given IO. Hence if there isn't a page in the page cache over an extent in the filesystem, the filesystem knows exactly how that page should be instantiated for the operation being performed. Therefore it doesn't matter if entire pages beyond EOF are truncated away without being zeroed - a request to read or write that section of the block be beyond EOF and so the filesystem will zero appropriately on read or EOF extension on write at page cache instantiation time. That's what the IOMAP_F_ZERO_AROUND functionality in the above patchset was for - making sure page cache instantiation was done correctly for all the different IO operations so that stale data was never exposed to userspace. 6 billion fsx ops tends to find most cache instantiation problems in the IO path :) > Three pages means (if the last page of the file is dirty) we'd need to > add in either a freshly allocated zero page or the generic zero page to > the bio when writing back the last page. > > Four pages mean we'll need to teach the truncate code to use the larger > of page size and block size when deciding the boundary to truncate the > page cache to, and zero the last page(s) of the file if needed. No. It's beyond EOF, so we have a clear mechanism for zero-on-instantiation behaviour. We don't have to touch truncate at all. > Depending on your answer, I may have some follow-up questions about how > we handle reading a 10kB file with an 8kB block size on a 4kB PAGE SIZE > machine (whether we allocate 3 or 4 pages, and what we do about the > extra known-to-be-zero bytes that will come from the device). IOMAP_F_ZERO_AROUND handles all that. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx