On Thu, Sep 12, 2019 at 3:31 AM Marc Schöchlin <ms@xxxxxxxxxx> wrote: > > Hello Jason, > > yesterday i started rbd-nbd in forground mode to see if there are any additional informations. > > root@int-nfs-001:/etc/ceph# rbd-nbd map rbd_hdd/int-nfs-001_srv-ceph -d --id nfs > 2019-09-11 13:07:41.444534 7ffff7fe1040 0 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable), process rbd-nbd, pid 14735 > 2019-09-11 13:07:41.444555 7ffff7fe1040 0 pidfile_write: ignore empty --pid-file > /dev/nbd0 > --------- > > > 2019-09-11 21:31:03.126223 7fffc3fff700 -1 rbd-nbd: failed to read nbd request header: (33) Numerical argument out of domain > > Whats that, have we seen that before? ("Numerical argument out of domain") It's the error that rbd-nbd prints when the kernel prematurely closes the socket ... and as we have already discussed, it's closing the socket due to the IO timeout being hit ... and it's hitting the IO timeout due to a deadlock due to memory pressure from rbd-nbd causing IO to pushed from the XFS cache back down into rbd-nbd. > Am 10.09.19 um 16:10 schrieb Jason Dillaman: > > [Tue Sep 10 14:46:51 2019] ? __schedule+0x2c5/0x850 > > [Tue Sep 10 14:46:51 2019] kthread+0x121/0x140 > > [Tue Sep 10 14:46:51 2019] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] > > [Tue Sep 10 14:46:51 2019] ? kthread+0x121/0x140 > > [Tue Sep 10 14:46:51 2019] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] > > [Tue Sep 10 14:46:51 2019] ? kthread_park+0x90/0x90 > > [Tue Sep 10 14:46:51 2019] ret_from_fork+0x35/0x40 > > Perhaps try it w/ ext4 instead of XFS? > > I can try that, but i am skeptical, i am note sure that we are searching on the right place... > > Why? > - we run hundreds of heavy use rbd-nbd instances in our xen dom-0 systems for 1.5 years now > - we never experienced problems like that in xen dom0 systems > - as described these instances run 12.2.5 ceph components with kernel 4.4.0+10 > - the domU (virtual machines) are interacting heavily with that dom0 are using various filesystems > -> probably the architecture of the blktap components leads to different io scenario : https://wiki.xenproject.org/wiki/Blktap Are you running a XFS (or any) file system on top of the NBD block device in dom0? I suspect you are just passing raw block devices to the VMs and therefore they cannot see the same IO back pressure feedback loop. > Nevertheless i will try EXT4 on another system..... > > Regards > Marc > -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com