On Wed, 2020-03-11 at 09:14 +1100, Dave Chinner wrote: > On Tue, Mar 10, 2020 at 08:45:58AM +0000, Rantala, Tommi T. (Nokia - > FI/Espoo) wrote: > > Hello, > > > > One of my GitLab CI jobs stopped working after upgrading server > > 5.4.18- > > 100.fc30.x86_64 -> 5.5.7-100.fc30.x86_64. > > (tested 5.5.8-100.fc30.x86_64 too, no change) > > The server is fedora30 with XFS rootfs. > > The problem reproduces always, and takes only couple minutes to run. > > > > The CI job fails in the beginning when doing "git clean" in docker > > container, and failing to rmdir some directory: > > "warning: failed to remove > > .vendor/pkg/mod/golang.org/x/net@v0.0.0-20200114155413-6afb5195e5aa/in > > tern > > al/socket: Directory not empty" > > > > Quick google search finds some other people reporting similar problems > > with 5.5.0: > > https://gitlab.com/gitlab-org/gitlab-runner/issues/3185 > > Which appears to be caused by multiple gitlab processes modifying > the directory at the same time. i.e. something is adding an entry to > the directory at the same time something is trying to rm -rf it. > That's a race condition, and would lead to the exact symptoms you > see here, depending on where in the directory the new entry is > added. OK traced "execve" with strace too, and it shows that it's "git clean -ffdx" command (single process) that is being executed in the container, which is doing the cleanup. Tested with 5.6-rc5, it's failing the same way. Spent some time to bisect this, and the problem is introduced by this: commit 263dde869bd09b1a709fd92118c7fff832773689 Author: Christoph Hellwig <hch@xxxxxx> Date: Fri Nov 8 15:05:32 2019 -0800 xfs: cleanup xfs_dir2_block_getdents Use an offset as the main means for iteration, and only do pointer arithmetics to find the data/unused entries. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> Hmmmmm, looking at that commit, I think it slighty changed how the "offset" is used compared to how the pointers were used. This cures the issue for me, tested (briefly) on top of 5.6-rc5. Does it make sense...? (Email client probably damages white-space, sorry, I'll send this properly signed-off with git-send-email if it's OK) diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c index 0d3b640cf1cc..af945ec9df3b 100644 --- a/fs/xfs/xfs_dir2_readdir.c +++ b/fs/xfs/xfs_dir2_readdir.c @@ -179,6 +179,7 @@ xfs_dir2_block_getdents( struct xfs_dir2_data_unused *dup = bp->b_addr + offset; struct xfs_dir2_data_entry *dep = bp->b_addr + offset; uint8_t filetype; + unsigned int dep_offset; /* * Unused, skip it. @@ -188,18 +189,21 @@ xfs_dir2_block_getdents( continue; } + dep_offset = offset; + /* - * Bump pointer for the next iteration. + * Bump offset for the next iteration. */ offset += xfs_dir2_data_entsize(dp->i_mount, dep->namelen); /* * The entry is before the desired starting point, skip it. */ - if (offset < wantoff) + if (dep_offset < wantoff) continue; - cook = xfs_dir2_db_off_to_dataptr(geo, geo->datablk, offset); + cook = xfs_dir2_db_off_to_dataptr(geo, geo->datablk, + dep_offset); ctx->pos = cook & 0x7fffffff; filetype = xfs_dir2_data_get_ftype(dp->i_mount, dep); > > Collected some data with strace, and it seems that getdents is not > > returning all entries: > > > > 5.4 getdents64() returns 52+50+1+0 entries > > => all files in directory are deleted and rmdir() is OK > > > > 5.5 getdents64() returns 52+50+0+0 entries > > => rmdir() fails with ENOTEMPTY > > Yup, that's a classic userspace TOCTOU race. > > Remember, getdents() is effectively a sequential walk through the > directory data - subsequent calls start at the offset (cookie) where > the previous one left off. New entries can be added between > getdents() syscalls. > > If that new entry is put at the tail of the directory, then the last > getdents() call will return that entry rather than none because it > was placed at an offset in the directory that the getdents() sweep > has not yet reached, and hence will be found by a future getdents() > call in the sweep. > > > However, if there is a hole in the directory structure before the > current getdents cookie offset, a new entry can be added in that > hole. i.e. at an offset in the directory that getdents has already > passed over. That dirent will never be reported by the current > getdents() sequence - a directory rewind and re-read is required to > find it. i.e. there's an inherent userspace TOUTOC race condition in > 'rm -rf' operations. > > IOWs, this is exactly what you'd expect to see when there are > concurrent userspace modifications to a directory that is currently > being read. Hence you need to rule out an application and userspace > level issues before looking for filesystem level problems. > > Cheers, > > Dave.