On Tue 04-05-21 16:35:53, Junxiao Bi wrote: > On 5/4/21 2:02 AM, Jan Kara wrote: > > On Mon 03-05-21 10:25:31, Junxiao Bi wrote: > > > On 5/3/21 3:29 AM, Jan Kara wrote: > > > > On Fri 30-04-21 14:18:15, Junxiao Bi wrote: > > > > > On 4/30/21 5:47 AM, Jan Kara wrote: > > > > > > > > > > > On Thu 29-04-21 11:07:15, Junxiao Bi wrote: > > > > > > > On 4/29/21 10:14 AM, Andreas Gruenbacher wrote: > > > > > > > > On Tue, Apr 27, 2021 at 4:44 AM Junxiao Bi <junxiao.bi@xxxxxxxxxx> wrote: > > > > > > > > > When doing truncate/fallocate for some filesytem like ocfs2, it > > > > > > > > > will zero some pages that are out of inode size and then later > > > > > > > > > update the inode size, so it needs this api to writeback eof > > > > > > > > > pages. > > > > > > > > is this in reaction to Jan's "[PATCH 0/12 v4] fs: Hole punch vs page > > > > > > > > cache filling races" patch set [*]? It doesn't look like the kind of > > > > > > > > patch Christoph would be happy with. > > > > > > > Thank you for pointing the patch set. I think that is fixing a different > > > > > > > issue. > > > > > > > > > > > > > > The issue here is when extending file size with fallocate/truncate, if the > > > > > > > original inode size > > > > > > > > > > > > > > is in the middle of the last cluster block(1M), eof part will be zeroed with > > > > > > > buffer write first, > > > > > > > > > > > > > > and then new inode size is updated, so there is a window that dirty pages is > > > > > > > out of inode size, > > > > > > > > > > > > > > if writeback is kicked in, block_write_full_page will drop all those eof > > > > > > > pages. > > > > > > I agree that the buffers describing part of the cluster beyond i_size won't > > > > > > be written. But page cache will remain zeroed out so that is fine. So you > > > > > > only need to zero out the on disk contents. Since this is actually > > > > > > physically contiguous range of blocks why don't you just use > > > > > > sb_issue_zeroout() to zero out the tail of the cluster? It will be more > > > > > > efficient than going through the page cache and you also won't have to > > > > > > tweak block_write_full_page()... > > > > > Thanks for the review. > > > > > > > > > > The physical blocks to be zeroed were continuous only when sparse mode is > > > > > enabled, if sparse mode is disabled, unwritten extent was not supported for > > > > > ocfs2, then all the blocks to the new size will be zeroed by the buffer > > > > > write, since sb_issue_zeroout() will need waiting io done, there will be a > > > > > lot of delay when extending file size. Use writeback to flush async seemed > > > > > more efficient? > > > > It depends. Higher end storage (e.g. NVME or NAS, maybe some better SATA > > > > flash disks as well) do support WRITE_ZERO command so you don't actually > > > > have to write all those zeros. The storage will just internally mark all > > > > those blocks as having zeros. This is rather fast so I'd expect the overall > > > > result to be faster that zeroing page cache and then writing all those > > > > pages with zeroes on transaction commit. But I agree that for lower end > > > > storage this may be slower because of synchronous writing of zeroes. That > > > > being said your transaction commit has to write those zeroes anyway so the > > > > cost is only mostly shifted but it could still make a difference for some > > > > workloads. Not sure if that matters, that is your call I'd say. > > > Ocfs2 is mostly used with SAN, i don't think it's common for SAN storage to > > > support WRITE_ZERO command. > > > > > > Anything bad to add a new api to support eof writeback? > > OK, now that I reread the whole series you've posted I think I somewhat > > misunderstood your original problem and intention. So let's first settle > > on that. As far as I understand the problem happens when extending a file > > (either through truncate or through write beyond i_size). When that > > happens, we need to make sure that blocks (or their parts) that used to be > > above i_size and are not going to be after extension are zeroed out. > > Usually, for simple filesystems such as ext2, there is only one such block > > - the one straddling i_size - where we need to make sure this happens. And > > we achieve that by zeroing out tail of this block on writeout (in > > ->writepage() handler) and also by zeroing out tail of the block when > > reducing i_size (block_truncate_page() takes care of this for ext2). So the > > tail of this block is zeroed out on disk at all times and thus we have no > > problem when extending i_size. > > > > Now what I described doesn't work for OCFS2. As far as I understand the > > reason is that when block size is smaller than page size and OCFS2 uses > > cluster size larger than block size, the page straddling i_size can have > > also some buffers mapped (with underlying blocks allocated) that are fully > > outside of i_size. These blocks are never written because of how > > __block_write_full_page() currently behaves (never writes buffers fully > > beyond i_size) so even if you zero out page cache and dirty the page, > > racing writeback can clear dirty bits without writing those blocks and so > > they are not zeroed out on disk although we are about to expand i_size. > Correct. > > > > Did I understand the problem correctly? But what confuses me is that > > ocfs2_zero_extend_range() (ocfs2_write_zero_page() in fact) actually does > > extend i_size to contain the range it zeroes out while still holding the > > page lock so it should be protected against the race with writeback I > > outlined above. What am I missing? > > Thank you for pointing this. I didn't realize ocfs2_zero_extend() will > update inode size, > > with it, truncate to extend file will not suffer this issue. The original > issue happened with > > qemu that used the following fallocate to extend file size. The first > fallocate punched > > hole beyond the inode size(2276196352) but not update isize, the second one > updated > > isize, the first one will do some buffer write to zero eof blocks in > ocfs2_remove_inode_range > > ->ocfs2_zero_partial_clusters->ocfs2_zero_range_for_truncate. > > fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2276196352, > 65536) = 0 > fallocate(11, 0, 2276196352, 65536) = 0 OK, I see. And AFAICT it is not about writeback racing with the zeroing in ocfs2_zero_range_for_truncate() but rather the filemap_fdatawrite_range() there not writing out zeroed pages if they are beyond i_size. And honestly, rather than trying to extend block_write_full_page() for this odd corner case, I'd use sb_issue_zeroout() or code something similar to __blkdev_issue_zero_pages() inside OCFS2. Because making pages in the page cache beyond i_size work is always going to be fragile... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR