On Tue, Mar 10, 2020 at 08:45:58AM +0000, Rantala, Tommi T. (Nokia - FI/Espoo) wrote: > Hello, > > One of my GitLab CI jobs stopped working after upgrading server 5.4.18- > 100.fc30.x86_64 -> 5.5.7-100.fc30.x86_64. > (tested 5.5.8-100.fc30.x86_64 too, no change) > The server is fedora30 with XFS rootfs. > The problem reproduces always, and takes only couple minutes to run. > > The CI job fails in the beginning when doing "git clean" in docker > container, and failing to rmdir some directory: > "warning: failed to remove > .vendor/pkg/mod/golang.org/x/net@v0.0.0-20200114155413-6afb5195e5aa/intern > al/socket: Directory not empty" > > Quick google search finds some other people reporting similar problems > with 5.5.0: > https://gitlab.com/gitlab-org/gitlab-runner/issues/3185 Which appears to be caused by multiple gitlab processes modifying the directory at the same time. i.e. something is adding an entry to the directory at the same time something is trying to rm -rf it. That's a race condition, and would lead to the exact symptoms you see here, depending on where in the directory the new entry is added. > Collected some data with strace, and it seems that getdents is not > returning all entries: > > 5.4 getdents64() returns 52+50+1+0 entries > => all files in directory are deleted and rmdir() is OK > > 5.5 getdents64() returns 52+50+0+0 entries > => rmdir() fails with ENOTEMPTY Yup, that's a classic userspace TOCTOU race. Remember, getdents() is effectively a sequential walk through the directory data - subsequent calls start at the offset (cookie) where the previous one left off. New entries can be added between getdents() syscalls. If that new entry is put at the tail of the directory, then the last getdents() call will return that entry rather than none because it was placed at an offset in the directory that the getdents() sweep has not yet reached, and hence will be found by a future getdents() call in the sweep. However, if there is a hole in the directory structure before the current getdents cookie offset, a new entry can be added in that hole. i.e. at an offset in the directory that getdents has already passed over. That dirent will never be reported by the current getdents() sequence - a directory rewind and re-read is required to find it. i.e. there's an inherent userspace TOUTOC race condition in 'rm -rf' operations. IOWs, this is exactly what you'd expect to see when there are concurrent userspace modifications to a directory that is currently being read. Hence you need to rule out an application and userspace level issues before looking for filesystem level problems. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx