Re: reproducible rbd-nbd crashes

Jason Dillaman <jdillama@xxxxxxxxxx> · Thu, 12 Sep 2019 10:56:04 -0400

On Thu, Sep 12, 2019 at 3:31 AM Marc Schöchlin <ms@xxxxxxxxxx> wrote:
>
> Hello Jason,
>
> yesterday i started rbd-nbd in forground mode to see if there are any additional informations.
>
> root@int-nfs-001:/etc/ceph# rbd-nbd map rbd_hdd/int-nfs-001_srv-ceph -d --id nfs
> 2019-09-11 13:07:41.444534 7ffff7fe1040  0 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable), process rbd-nbd, pid 14735
> 2019-09-11 13:07:41.444555 7ffff7fe1040  0 pidfile_write: ignore empty --pid-file
> /dev/nbd0
> ---------
>
>
> 2019-09-11 21:31:03.126223 7fffc3fff700 -1 rbd-nbd: failed to read nbd request header: (33) Numerical argument out of domain
>
> Whats that, have we seen that before? ("Numerical argument out of domain")

It's the error that rbd-nbd prints when the kernel prematurely closes
the socket ... and as we have already discussed, it's closing the
socket due to the IO timeout being hit ... and it's hitting the IO
timeout due to a deadlock due to memory pressure from rbd-nbd causing
IO to pushed from the XFS cache back down into rbd-nbd.

> Am 10.09.19 um 16:10 schrieb Jason Dillaman:
> > [Tue Sep 10 14:46:51 2019]  ? __schedule+0x2c5/0x850
> > [Tue Sep 10 14:46:51 2019]  kthread+0x121/0x140
> > [Tue Sep 10 14:46:51 2019]  ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
> > [Tue Sep 10 14:46:51 2019]  ? kthread+0x121/0x140
> > [Tue Sep 10 14:46:51 2019]  ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
> > [Tue Sep 10 14:46:51 2019]  ? kthread_park+0x90/0x90
> > [Tue Sep 10 14:46:51 2019]  ret_from_fork+0x35/0x40
> > Perhaps try it w/ ext4 instead of XFS?
>
> I can try that, but i am skeptical, i am note sure that we are searching on the right place...
>
> Why?
> - we run hundreds of heavy use rbd-nbd instances in our xen dom-0 systems for 1.5 years now
> - we never experienced problems like that in xen dom0 systems
> - as described these instances run 12.2.5 ceph components with kernel 4.4.0+10
> - the domU (virtual machines) are interacting heavily with that dom0 are using various filesystems
>    -> probably the architecture of the blktap components leads to different io scenario : https://wiki.xenproject.org/wiki/Blktap

Are you running a XFS (or any) file system on top of the NBD block
device in dom0? I suspect you are just passing raw block devices to
the VMs and therefore they cannot see the same IO back pressure
feedback loop.

> Nevertheless i will try EXT4 on another system.....
>
> Regards
> Marc
>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com