On Fri, Dec 21, 2012 at 11:49:47PM +0100, Jan Kara wrote: > It's actually simpler than that. We wait for any pending DIO using > inode_dio_wait() and i_mutex protects from new writes to be submitted. So > that takes care of one possibility. truncate_inode_pages() waits for > PageWriteback bit so that handles waiting for IO itself. Hmm, yes, I should have known/remembered that. I've seen cases where very rarely, it's possible for a unlink() or truncate() call to stall for multiple minutes(!). This can happen if you have writeback happening in a container which has a very small (low priority) constraint on its block I/O bandwidth. If you try to delete an inode which has writeback work pending, it's possible for the writeback to take a looong time, which in turn causes the unlink to take a long time. It becomes worse the process doing the unlink is a high priority process (say, the cluster management daemon who is cleaning up after said low-priority job has completed), but the writeback is happening in the context of a low priority cgroup. You can end up with a nasty priority inversion. And there's not a lot we can do at the kernel level. We could dispatch the truncate to a workqueue and just make sure the file name has disappeared from the file system name space before the unlink() to userspace, but then the disk space gets released after the unlink() call returns, which can cause other problems. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html