On 13/06/18 11:46, Daniel P. Berrangé wrote: > On Sun, Jun 10, 2018 at 12:14:22PM +0100, Radostin Stoyanov wrote: >> Hi all, >> >> This patch series aims to resolve >> https://bugzilla.redhat.com/show_bug.cgi?id=1328946 >> >> For background information about the issue see v1 of this RFC. >> https://www.redhat.com/archives/libvir-list/2018-April/msg01270.html >> >> The current state of this series enables the start of LXC container with NBD >> file system and enabled user namespace. >> >> However, container shutdown causes "kernel BUG at fs/buffer.c:3058!" >> https://pastebin.com/raw/y0ycSM0H >> >> The reason for this is because qemu-nbd process is terminated/killed without >> unmounting the container root file system. >> >> This issue has been reported in [1] and [2]. >> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1356110 >> [2] http://lkml.iu.edu/hypermail/linux/kernel/1509.3/00027.html > This is not really a kernel bug at the end of the day. We have a filesystem > backed by NBD block device, and we're killing the NBD block device. So there's > nothing the kernel can really do here if there's outstanding I/O pendnig at > this time. > > There is also this BZ reported against libvirt that has more info: > > https://bugzilla.redhat.com/show_bug.cgi?id=1570902 > >> As a workaround we could unmount the root file system of container before shutdown. >> >> For example with: >> $ CT_PID=$(pidof libvirt_lxc) >> $ sudo nsenter \ >> --mount=/proc/$CT_PID/task/$CT_PID/ns/mnt \ >> /bin/bash -c "umount /var/run/libvirt/lxc/guest.root/" >> >> I noticed that we already have the functions lxcContainerUnmountSubtree >> and virProcessRunInMountNamespace. >> >> Any suggestions on how to properly implement this? > We can't unmount the filesystem directly because we don't have any process > running inside the container's mount namespace at this time. The libvirt_lxc > controller is running in a custom mount namespace that is different from what > the container has. > > The first thing we need todo is take qemu-nbd out of the cgroups. This will > ensure that it doesn't get killed at the same time as we're killing off all > the container PIDs. It will also fix the OOM deadlocks we see when the memory > controller prevents qemu-nbd allocating RAM needed to proces I/O. > > Then, we can kill all processes in the container as normal. Once they are > all gone, we know the kernel will have cleaned up the mount namespace. We > can thus safely kill qemu-nbd at this point. Thank you for the pointers! > Ideally qemu-nbd would automatically exit when the last use of /dev/nbdNNN > was release (ie when filesystem was unmounted). This is something you can > enable for loopback devices, but I'm not sure it works for NBD. THis would > be a useful kernel enhancement if someone feels adventurous. It seems like qemu-nbd terminates automatically when the last client disconnects. https://git.qemu.org/?p=qemu.git;a=blob;f=qemu-nbd.c;h=51b9d38c72732c821cb4ee5bf362533406ce2494;hb=HEAD#l341 I will send a patch thattakes qemu-nbd out of the cgroups and disconnects qemu-nbd on container shutdown. Radostin -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list