On Thu, Jul 30, 2020 at 09:05:03AM +1000, Dave Chinner wrote: > On Wed, Jul 29, 2020 at 07:50:35PM +0100, Matthew Wilcox wrote: > > I had a bit of a misunderstanding. Let's discard that proposal > > and discuss what we want to optimise for, ignoring THPs. We don't > > need to track any per-block state, of course. We could implement > > __iomap_write_begin() by reading in the entire page (skipping the last > > few blocks if they lie outside i_size, of course) and then marking the > > entire page Uptodate. > > __iomap_write_begin() already does read-around for sub-page writes. > And, if necessary, it does zeroing of unwritten extents, newly > allocated ranges and ranges beyond EOF and marks them uptodate > appropriately. But it doesn't read in the entire page, just the blocks in the page which will be touched by the write. > > Buffer heads track several bits of information about each block: > > - Uptodate (contents of cache at least as recent as storage) > > - Dirty (contents of cache more recent than storage) > > - ... er, I think all the rest are irrelevant for iomap > > > Yes, it is. And we optimised out the dirty tracking by just using > the single dirty bit in the page. > > > I think I just talked myself into what you were arguing for -- that we > > change the ->uptodate bit array into a ->dirty bit array. > > > > That implies that we lose the current optimisation that we can write at > > a blocksize alignment into the page cache and not read from storage. > > iomap does not do that. It always reads the entire page in, even for > block aligned sub-page writes. IIRC, we even allocate on page > granularity for sub-page block size filesystems so taht we fill > holes and can do full page writes in writeback because this tends to > significantly reduce worst case file fragmentation for random sparse > writes... That isn't what __iomap_write_begin() does today. Consider a 1kB block size filesystem and a 4kB page size host. Trace through writing 1kB at a 2kB offset into the file. We call iomap_write_begin() with pos of 2048, len 1024. Allocate a new page Call __iomap_write_begin(2048, 1024) block_start = 2048 block_end = 3072 iomap_adjust_read_range() sets poff and plen to 2048 & 1024 from == 2048, to == 3072, so we continue block_start + plen == block_end so the loop terminates. We didn't read anything. > Modern really SSDs don't care about runs of zeros being written. > They compress and/or deduplicate such things on the fly as part of > their internal write-amplification reduction strategies. Pretty much > all SSDs on the market these days - consumer or enterprise - do this > sort of thing in their FTLs and so writing more than the exact > changed data really doesn't make a difference. You're clearly talking to different SSD people than I am.