On Fri, Aug 30, 2024 at 07:57:41AM -0400, Brian Foster wrote: > On Thu, Aug 29, 2024 at 02:48:00PM -0700, Darrick J. Wong wrote: > > On Thu, Aug 29, 2024 at 11:05:59AM -0400, Brian Foster wrote: > > > On Wed, Aug 28, 2024 at 10:41:56PM -0700, Christoph Hellwig wrote: > > > > On Wed, Aug 28, 2024 at 08:35:47AM -0400, Brian Foster wrote: > > > > > Yeah, it was buried in a separate review around potentially killing off > > > > > iomap_truncate_page(): > > > > > > > > > > https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/ > > > > > > > > > > The idea is pretty simple.. use the same kind of check this patch does > > > > > for doing a flush, but instead open code and isolate it to > > > > > iomap_truncate_page() so we can just default to doing the buffered write > > > > > instead. > > > > > > > > > > Note that I don't think this replaces the need for patch 1, but it might > > > > > arguably make further optimization of the flush kind of pointless > > > > > because I'm not sure zero range would ever be called from somewhere that > > > > > doesn't flush already. > > > > > > > > > > The tradeoffs I can think of are this might introduce some false > > > > > positives where an EOF folio might be dirty but a sub-folio size block > > > > > backing EOF might be clean, and again that callers like truncate and > > > > > write extension would need to both truncate the eof page and zero the > > > > > broader post-eof range. Neither of those seem all that significant to > > > > > me, but just my .02. > > > > > > > > Looking at that patch and your current series I kinda like not having > > > > to deal with the dirty caches in the loop, and in fact I'd also prefer > > > > to not do any writeback from the low-level zero helpers if we can. > > > > That is not doing your patch 1 but instead auditing the callers if > > > > any of them needs them and documenting the expectation. > > > > I looked, and was pretty sure that XFS is the only one that has that > > expectation. > > > > > I agree this seems better in some ways, but I don't like complicating or > > > putting more responsibility on the callers. I think if we had a high > > > level iomap function that wrapped a combination of this proposed variant > > > of truncate_page() and zero_range() for general inode size changes, that > > > might alleviate that concern. > > > > > > Otherwise IME even if we audited and fixed all callers today, over time > > > we'll just reintroduce the same sorts of errors if the low level > > > mechanisms aren't made to function correctly. > > > > Yeah. What /are/ the criteria for needing the flush and wait? AFAICT, > > a filesystem only needs the flush if it's possible to have dirty > > pagecache backed either by a hole or an unwritten extent, right? > > > > Yeah, but this flush behavior shouldn't be a caller consideration at > all. It's just an implementation detail. All the caller should care > about is that zero range works As Expected (tm). > > The pre-iomap way of doing this in XFS was xfs_zero_eof() -> > xfs_iozero(), which was an internally coded buffered write loop that > wrote zeroes into pagecache. That was ultimately replaced with > iomap_zero_range() with the same sort of usage expectations, but > iomap_zero_range() just didn't work quite correctly in all cases. > > > I suppose we could amend the iomap ops so that filesystems could signal > > that they allow either of those things, and then we wouldn't have to > > query the mapping for filesystems that don't, right? IOWs, one can opt > > out of safety features if there's no risk of a garbage, right? > > > > Not sure I parse.. In general I think we could let ops signal whether > they want certain checks. This is how I used the IOMAP_F_DIRTY_CACHE > flag mentioned in the other thread. If the operation handler is > interested in pagecache state, set an IOMAP_DIRTY_CACHE flag in ops to > trigger a pre iomap_begin() check and then set the corresponding > _F_DIRTY_CACHE flag on the mapping if dirty, but I'm not sure if that's > the same concept you're alluding to here. Nope. I was thinking about adding a field to iomap_ops so that filesystems could declare which types of mappings they could return: const struct iomap_ops xfs_buffered_write_iomap_ops = { .iomap_begin = xfs_buffered_write_iomap_begin, .iomap_end = xfs_buffered_write_iomap_end, /* xfs allows sparse holes and unwritten extents */ .iomap_types = (1U << IOMAP_UNWRITTEN) | (1U << IOMAP_HOLE), }; But given your statement below about dirtying a post-eof region for which the filesystem does not allocate a block, I suspect we just have to enable the flush thing for everyone and don't need the flags thing. > > (Also: does xfs allow dirty page cache backed by a hole? I didn't think > > that was possible.) > > > > It's a corner case. A mapped write can write to any portion of a folio > so long as it starts within eof. So if you have a mapped write that > writes past EOF, there's no guarantee that range of the folio is mapped > by blocks. > > That post-eof part of the folio would be zeroed at writeback time, but > that assumes i_size doesn't change before writeback. If it does and the > size change operation doesn't do the zeroing itself (enter zero range > via write extension), then we end up with a dirty folio at least > partially backed by a hole with non-zero data within EOF. There's > nothing written back to disk in this hole backed example, but the > pagecache is still inconsistent with what's on disk and therefore I > suspect data corruption is possible if the folio is redirtied before > reclaimed. Ah. Yikes. --D > Brian > > > > > But please let Dave and Darrick chime in first before investing any > > > > work into this. > > > > > > > > > > > > > > Based on the feedback to v2, it sounds like there's general consensus on > > > the approach modulo some code factoring discussion. Unless there is > > > objection, I think I'll stick with that for now for the sake of progress > > > and keep this option in mind on the back burner. None of this is really > > > that hard to change if we come up with something better. > > > > > > Brian > > > > > > > > > >