On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote: > On Wed 08-03-17 21:57:25, Ted Tso wrote: > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > > On a more general note (DAX is actually fine here), I find the current > > > practice of clearing page dirty bits on error and reporting it just once > > > problematic. It keeps the system running but data is lost and possibly > > > without getting the error anywhere where it is useful. We get away with > > > this because it is a rare event but it seems like a problematic behavior. > > > But this is more for the discussion at LSF. > > > > I'm actually running into this in the last day or two because some MM > > folks at $WORK have been trying to push hard for GFP_NOFS removal in > > ext4 (at least when we are holding some mutex/semaphore like > > i_data_sem) because otherwise it's possible for the OOM killer to be > > unable to kill processes because they are holding on to locks that > > ext4 is holding. > > > > I've done some initial investigation, and while it's not that hard to > > remove GFP_NOFS from certain parts of the writepages() codepath (which > > is where we had been are running into problems), a really, REALLY big > > problem is if any_filesystem->writepages() returns ENOMEM, it causes > > silent data loss, because the pages are marked clean, and so data > > written using buffered writeback goes *poof*. > > > > I confirmed this by creating a test kernel with a simple patch such > > that if the ext4 file system is mounted with -o debug, there was a 1 > > in 16 chance that ext4_writepages will immediately return with ENOMEM > > (and printk the inode number, so I knew which inodes had gotten the > > ENOMEM treatment). The result was **NOT** pretty. > > > > What I think we should strongly consider is at the very least, special > > case ENOMEM being returned by writepages() during background > > writeback, and *not* mark the pages clean, and make sure the inode > > stays on the dirty inode list, so we can retry the write later. This > > is especially important since the process that issued the write may > > have gone away, so there might not even be a userspace process to > > complain to. By converting certain page allocations (most notably in > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to > > release the i_data_sem lock and return an error. This should allow > > allow the OOM killer to do its dirty deed, and hopefully we can retry > > the writepages() for that inode later. > > Yeah, so if we can hope the error is transient, keeping pages dirty and > retrying the write is definitely better option. For start we can say that > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means > there's no hope of getting data to disk and so we just discard them. It > will be somewhat rough distinction but probably better than what we have > now. > > Honza I'm not sure about ENOSPC there. That's a return code that is specifically expected to be returned by fsync. It seems like that ought not be considered a transient error? -- Jeff Layton <jlayton@xxxxxxxxxx>