On Mon, May 22, 2017 at 06:38:12PM -0600, Jon Derrick wrote: > Hello, > > I've encountered a BUG that I've experienced during hot removal on an > ext4-formatted nvme device undergoing writes. I have been able to verify > that 4.5, 4.6, 4.10.12, 4.11, and 4.12-rc1 show similar issues (the v4.6 > trace below shows issues with block that have already been fixed). I'm > using VMD hardware for my hotplug controller so 4.5 is as far back as I > can go (maybe someone else can verify on non-VMD hardware?). > > To reproduce: > 1) mkfs.ext4 <nvme> > 2) mount <nvme> <mnt> > 3) dd if=/dev/zero of=<mnt>/file bs=1M count=10000 > 4) Hot remove the drive while above is writing > > From what I can tell, the ext4 sb is trying to be committed in the error > path. There is supposed to be a check if the device is still alive via > block_device_ejected(), but my guess is that there is a race between the > removal/deletion in genhd and this check. I would appreciate any help > resolving this. > Recently I played fio over NVMe partition direclty with hot-remove too, and found that d3cfb2a0ac0b8487d28(block: block new I/O just after queue is set as dying) is helpful for this kind of issue. Also the following patch fixes one issue in remove path. http://marc.info/?l=linux-block&m=149498450028434&w=2 So could you test v4.12-rc1(d3cfb2a0 is merged) with the above patch? With these patches in, block layer & NVMe should make sure that all I/O can be finished with -EIO before del_gendisk() returns once after hot-remove is triggered, then the failure handling of fs might need further investigation. Thanks, Ming