On Fri, Sep 18, 2020 at 06:44:01PM -0700, Keith Busch wrote: > On Fri, Sep 18, 2020 at 11:47:27PM +0000, Meng Wang wrote: > > Hi, > > We found kernel panics today when doing test on hot remove U.2 nvme > > disk. After hot remove the nvme disk (formatted as ext4), the system > > freezes and all services stuck. Lot of kernel message flushed the > > syslog, including the CPU soft lockup, ext4 NULL point er dereferece > > and ib nic transmission timeout. The kernel panics and configuration > > are shown below. The used kernel is 5.4.0-050400-generic and OS is > > Ubuntu 16.04. Not sure whether it's a known bug or configuration > > error. Any advise are welcome. > > [cc'ing ext4 mailing list] > > The NULL dereference occured before the soft lockup, so I'm guessing the > Oops'ed process is holding the same lock the removal task wants. > > Your kernel is a bit older, so it may be worth verifying if your > observation still occurs on the current stable or current mainline, but > the ext4 developers may have a better idea as this doesn't at least > initially appear specific to nvme. The problem is the crazy __invalidate_device stuff that calls into file system eviction from all kinds of super critical block paths. While I haven't debugged the root cause this kind of thing just causes problems without really helping anyone. I have a half-finished series that kills this crap and instead allows the file system (or other block device user) to pass shutdown and resize callbacks when the exclusively open a block device. That way the file system driver can just mark the file system shutdown to prevent any further damage without all this mess.