Hi NeilBrown, > though if your kernel is older than 6.3, that will be > redirty_for_writepage(wbc, page); Things are looking good. I have ran it on 15 machines for good couple of hours and i do not see the problem. Usually i would see it after 1-3 iterations but now they are reaching 20 iterations without the problem. Thank you for the fix. Regards. Jacek Tomaka Temat: Re: NFS data corruption on congested network Data: 2024-02-26 0:19 Nadawca: "NeilBrown" <neilb@xxxxxxx> Adresat: "Jacek Tomaka" <Jacek.Tomaka@xxxxxxxxx>; DW: trond.myklebust@xxxxxxxxxxxxxxx; anna.schumaker@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; > >> On Mon, 26 Feb 2024, NeilBrown wrote: >> On Fri, 23 Feb 2024, Jacek Tomaka wrote: >>> Hello, >>> I ran into an issue where the NFS file ends up being corrupted on disk. We started noticing it on certain, quite old hardware after upgrading OS from Centos 6 to Rocky 9.2. We do see it on Rocky 9.3 but not on 9.1. >>> >>> After some investigation we have reasons to believe that the change was introduced by the following commit: >>> https://github.com/torvalds/linux/commit/6df25e58532be7a4cd6fb15bcd85805947402d91 >> >> Thanks for the report. >> Can you try a change to your kernel? >> >> diff --git a/fs/nfs/write.c b/fs/nfs/write.c >> index bb79d3a886ae..08a787147bd2 100644 >> --- a/fs/nfs/write.c >> +++ b/fs/nfs/write.c >> @@ -668,8 +668,10 @@ static int nfs_writepage_locked(struct folio *folio, >> int err; >> >> if (wbc->sync_mode == WB_SYNC_NONE && >> - NFS_SERVER(inode)->write_congested) >> + NFS_SERVER(inode)->write_congested) { >> + folio_redirty_for_writepage(wbc, folio); >> return AOP_WRITEPAGE_ACTIVATE; >> + } >> >> nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE); >> nfs_pageio_init_write(&pgio, inode, 0, false, > > Actually this is only needed before linux 6.8 as only nfs_writepage() > can call nfs_writepage_locked() with sync_mode of WB_SYNC_NONE. > So v5.18 through v6.7 might need fixing. > > NeilBrown > > >> >> >> though if your kernel is older than 6.3, that will be >> redirty_for_writepage(wbc, page); >> >> Thanks, >> NeilBrown >> >> >>> >>> We write a number of files on a single thread. Each file is up to 4GB. Before closing we call fdatasync. Sometimes the file ends up being corrupted. The corruptions is in a form of a number ( more than 3k pages in one case) of zero filled pages. >>> When this happens the file cannot be deleted from the client machine which created the file, even when the process which wrote the file completed successfully. >>> >>> The machines have about 128GB of memory, i think and probably network that leaves to be desired. >>> >>> My reproducer is currently tied up to our internal software, but i suspect setting the write_congested flag randomly should allow to reproduce the issue. >>> >>> Regards. >>> Jacek Tomaka >>> >> >> >> > > >