Re: 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting

Jan Kara <jack@xxxxxxx> · Thu, 14 Nov 2013 08:58:27 +0100

On Wed 13-11-13 05:59:11, Denys Fedoryshchenko wrote:
> Hi
> 
> On 2013-11-12 23:46, Jan Kara wrote:
> >Hello,
> >
> >On Tue 12-11-13 16:34:07, Denys Fedoryshchenko wrote:
> >>I just did some fault testing for test nbd setup, and found that if
> >>i reboot nbd server i will get immediately BUG() message on nbd
> >>client and filesystem that i cannot unmount, and any operations on
> >>it will freeze and lock processes trying to access it.
> >  So how exactly did you do the fault testing? Because it seems
> >something
> >has discarded the block device under filesystem's toes and the
> >superblock
> >buffer_head got unmapped. Didn't something call NBD_CLEAR_SOCK ioctl?
> >Because that calls kill_bdev() which would do exactly that...
> 
> Client side:
> modprobe nbd
> nbd-client 2.2.2.29 /dev/nbd0 -name export1
> nbd-client 2.2.2.29 /dev/nbd1 -name export2
> nbd-client 2.2.2.29 /dev/nbd2 -name export3
> mount /dev/nbd0 /mnt/disk1
> mount /dev/nbd1 /mnt/disk2
> mount /dev/nbd2 /mnt/disk3
> 
> On server i have config:
> [generic]
> [export1]
>         exportname = /dev/sda1
> [export2]
>         exportname = /dev/sdb1
> [export3]
>         exportname = /dev/sdc1
> 
> Steps to reproduce:
> 1)Start some large file copy on client side to /mnt/disk1/
> 2)Reboot server. It reboots quite fast, just few seconds, server
> system will get ip before nbd-server process started listening, so
> probably nbd-client will see connection refused.
> 3)seems when client gets connection refused - it is going mad
> 
> I can try to capture traffic dump, or do any other debug operation,
> please let me know, what i should run :)
> P.S. I noticed maybe i should run persist mode, but anyway it should
> not crash like this i think.
  OK, no need for further debugging. I see what's going on. In NBD_DO_IT
ioctl() nbd calls kill_bdev() after the kthread returned - and that happens
in your case as we can see from "queue cleared" messages.

Now there is a question how to fix this. Filesystems don't really expect
device buffers to disappear under us as they do when nbd calls kill_bdev().
Also that never happens with normal block devices - if a similar situation
happens to SCSI / SATA disk, corresponding block devices hang around
refusing any IO until the filesystem is unmounted and at that point they
disappear (device's refcount - bd_openers - reaches zero). It would be good
if NBD behaved the same way - maybe we should return from NBD_DO_IT ioctl
only after bd_openers drops to 1 (not zero because the nbd client has the
device open as well for the ioctl if I'm right)?

								Honza
---
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html