Re: NFS data corruption on congested network

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 26 Feb 2024, Jacek Tomaka wrote:
> Hi NeilBrown, 
> 
> > though if your kernel is older than 6.3, that will be
> >          redirty_for_writepage(wbc, page);
> 
> Things are looking good. I have ran it on 15 machines for good couple of hours and i do not see the problem. Usually i would see it after 1-3 iterations but now they are reaching 20 iterations without the problem.
> 
> Thank you for the fix.

Thanks for testing!  I'll get the fix submitted.

NeilBrown


> Regards.
> Jacek Tomaka
> 
> Temat: Re: NFS data corruption on congested network
> Data: 2024-02-26 0:19
> Nadawca: "NeilBrown" <neilb@xxxxxxx>
> Adresat: "Jacek Tomaka" <Jacek.Tomaka@xxxxxxxxx>; 
> DW: trond.myklebust@xxxxxxxxxxxxxxx; anna.schumaker@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; 
> 
> > 
> >> On Mon, 26 Feb 2024, NeilBrown wrote:
> >> On Fri, 23 Feb 2024, Jacek Tomaka wrote:
> >>> Hello,
> >>> I ran into an issue where the NFS file ends up being corrupted on
> disk. We started noticing it on certain, quite old hardware after upgrading
> OS from Centos 6 to Rocky 9.2. We do see it on Rocky 9.3 but not on 9.1.
> >>> 
> >>> After some investigation we have reasons to believe that the
> change was introduced by the following commit: 
> >>>
> https://github.com/torvalds/linux/commit/6df25e58532be7a4cd6fb15bcd85805947402d91
> >> 
> >> Thanks for the report.
> >> Can you try a change to your kernel?
> >> 
> >> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> >> index bb79d3a886ae..08a787147bd2 100644
> >> --- a/fs/nfs/write.c
> >> +++ b/fs/nfs/write.c
> >> @@ -668,8 +668,10 @@ static int nfs_writepage_locked(struct folio
> *folio,
> >>  	int err;
> >>  
> >>  	if (wbc->sync_mode == WB_SYNC_NONE &&
> >> -	    NFS_SERVER(inode)->write_congested)
> >> +	    NFS_SERVER(inode)->write_congested) {
> >> +		folio_redirty_for_writepage(wbc, folio);
> >>  		return AOP_WRITEPAGE_ACTIVATE;
> >> +	}
> >>  
> >>  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE);
> >>  	nfs_pageio_init_write(&pgio, inode, 0, false,
> > 
> > Actually this is only needed before linux 6.8 as only nfs_writepage()
> > can call nfs_writepage_locked() with sync_mode of WB_SYNC_NONE.
> > So v5.18 through v6.7 might need fixing.
> > 
> > NeilBrown
> > 
> > 
> >> 
> >> 
> >> though if your kernel is older than 6.3, that will be
> >>          redirty_for_writepage(wbc, page);
> >> 
> >> Thanks,
> >> NeilBrown
> >> 
> >> 
> >>> 
> >>> We write a number of files on a single thread. Each file is up to
> 4GB. Before closing we call fdatasync. Sometimes the file ends up being
> corrupted. The corruptions is in a form of a number ( more than 3k pages in
> one case) of zero filled pages.
> >>> When this happens the file cannot be deleted from the client
> machine which created the file, even when the process which wrote the file
> completed successfully.
> >>> 
> >>> The machines have about 128GB of memory, i think and probably
> network that leaves to be desired.
> >>> 
> >>> My reproducer is currently tied up to our internal software, but i
> suspect setting the write_congested flag randomly should allow to reproduce
> the issue.
> >>> 
> >>> Regards.
> >>> Jacek Tomaka
> >>> 
> >> 
> >> 
> >> 
> > 
> > 
> > 
> 






[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux