On Tue, Aug 07, 2018 at 09:57:45AM -0400, Josef Bacik wrote: > Hello, > > Our automated test infrastructure has been throwing errors while running > generic/388 on upstream kernels for a little bit. We run the same tests on our > internal kernels and it doesn't fail on any of our variations, the most recent > of which is 4.16 based, so it's relatively new. I reproduced locally to make > sure it wasn't a fluke, and it took 3 runs but I hit it as well. It's been around as long as we added shutdown FS_IOC_SHUTDOWN support. You might not have noticed it because it's a race which very much depends speed of the device. generic/388 will run in a loop (N times), fstress and then force a shutdown, and then run fsck. Sometimes, the file system will be have a corruption. It's on my todo list to fix, but the original use case was for scratch file systems that are mounted over remote block device like iSCSI, and if for some reason the iSCSI server stops responding, we are using the shutdown ioctl to take down the mount more quickly. Since it's for a scratch file system where the iSCSI device is ephemeral (and by the time we shut it down, it's toast), the question of whether the file system will be consistent afterwards really doesn't matter. Also, the obvious fixes would destroy ext4's scalability, and I'm not aware of anyone except for us at Google using the shutdown ioctl in production (at least not for ext4, and I doubt it's commonly used for most file systems), so it's been low priority for me to really set aside time to tackle. > I'm not sure where it got introduced, I'm running a bisect now to try and figure > out where it happened but I wanted to let you know ASAP. Thanks, I just want to save you some time when I say --- don't bother. The failure was known when the shutdown code was first added, and I have records of it failing go back to 4.10. The race doesn't always trigger, so your trying to bisect it will probably lead to a lot of frustration. Cheers, - Ted