Re: Uninitialized extent races

"Theodore Ts'o" <tytso@xxxxxxx> · Fri, 21 Dec 2012 18:03:35 -0500

On Fri, Dec 21, 2012 at 11:49:47PM +0100, Jan Kara wrote:
>   It's actually simpler than that. We wait for any pending DIO using
> inode_dio_wait() and i_mutex protects from new writes to be submitted. So
> that takes care of one possibility. truncate_inode_pages() waits for
> PageWriteback bit so that handles waiting for IO itself. 

Hmm, yes, I should have known/remembered that.  I've seen cases where
very rarely, it's possible for a unlink() or truncate() call to stall
for multiple minutes(!).  This can happen if you have writeback
happening in a container which has a very small (low priority)
constraint on its block I/O bandwidth.  If you try to delete an inode
which has writeback work pending, it's possible for the writeback to
take a looong time, which in turn causes the unlink to take a long
time.

It becomes worse the process doing the unlink is a high priority
process (say, the cluster management daemon who is cleaning up after
said low-priority job has completed), but the writeback is happening
in the context of a low priority cgroup.  You can end up with a nasty
priority inversion.

And there's not a lot we can do at the kernel level.  We could
dispatch the truncate to a workqueue and just make sure the file name
has disappeared from the file system name space before the unlink() to
userspace, but then the disk space gets released after the unlink()
call returns, which can cause other problems.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html