Hi Jason, i installed kernel 4.4.0-154.181 (from ubuntu package sources) and performed the crash reproduction. The problem also re-appeared with that kernel release. A gunzip with 10 gunzip processes throwed 1600 write and 330 read IOPS against the cluster/the rbd_ec volume with a transfer rate of 290MB/sec for 10 Minutes. After that the same problem re-appeared. What should we do now? Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, because the ceph apt source does not contain that version. Do you know a package source? How can i support you? Regards Marc Am 24.07.19 um 07:55 schrieb Marc Schöchlin: > Hi Jason, > > Am 24.07.19 um 00:40 schrieb Jason Dillaman: >>> Sure, which kernel do you prefer? >> You said you have never had an issue w/ rbd-nbd 12.2.5 in your Xen environment. Can you use a matching kernel version? > > Thats true, our virtual machines of our xen environments completly run on rbd-nbd devices. > Every host runs dozends of rbd-nbd maps which are visible as xen disks in the virtual systems. > (https://github.com/vico-research-and-consulting/RBDSR) > > It seems that xenserver has a special behavior with device timings because 1.5 years ago we had a outage of 1.5 hours of our ceph cluster which blocked all write requests > (overfull disks because of huge usage growth). In this situation all virtualmachines continue their work without problems after the cluster was back. > We haven't set any timeouts using nbd_set_timeout.c on these systems. > > We never experienced problems with these rbd-nbd instances. > > [root@xen-s31 ~]# rbd nbd ls > pid pool image snap device > 10405 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-72f4e61d-acb9-4679-9b1d-fe0324cb5436 - /dev/nbd3 > 12731 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 RBD-88f8889a-05dc-49ab-a7de-8b5f3961f9c9 - /dev/nbd4 > 13123 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 RBD-37243066-54b0-453a-8bf3-b958153a680d - /dev/nbd5 > 15342 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 RBD-2bee9bf7-4fed-4735-a749-2d4874181686 - /dev/nbd6 > 15702 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 RBD-5b93eb93-ebe7-4711-a16a-7893d24c1bbf - /dev/nbd7 > 27568 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 RBD-616a74b5-3f57-4123-9505-dbd4c9aa9be3 - /dev/nbd8 > 21112 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-5c673a73-7827-44cc-802c-8d626da2f401 - /dev/nbd9 > 15726 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-1069a275-d97f-48fd-9c52-aed1d8ac9eab - /dev/nbd10 > 4368 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 RBD-23b72184-0914-4924-8f7f-10868af7c0ab - /dev/nbd11 > 4642 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-bf13cf77-6115-466e-85c5-aa1d69a570a0 - /dev/nbd12 > 9438 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-a2071aa0-5f63-4425-9f67-1713851fc1ca - /dev/nbd13 > 29191 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 RBD-fd9a299f-dad9-4ab9-b6c9-2e9650cda581 - /dev/nbd14 > 4493 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 RBD-1bbb4135-e9ed-4720-a41a-a49b998faf42 - /dev/nbd15 > 4683 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-374cadac-d969-49eb-8269-aa125cba82d8 - /dev/nbd16 > 1736 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-478a20cc-58dd-4cd9-b8b1-6198014e21b1 - /dev/nbd17 > 3648 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-6e28ec15-747a-43c9-998d-e9f2a600f266 - /dev/nbd18 > 9993 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 RBD-61ae5ef3-9efb-4fe6-8882-45d54558313e - /dev/nbd19 > 10324 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-f7d27673-c268-47b9-bd58-46dcd4626bbb - /dev/nbd20 > 19330 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 RBD-0d4e5568-ac93-4f27-b24f-6624f2fa4a2b - /dev/nbd21 > 14942 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 RBD-69832522-fd68-49f9-810f-485947ff5e44 - /dev/nbd22 > 20859 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 RBD-5025b066-723e-48f5-bc4e-9b8bdc1e9326 - /dev/nbd23 > 19247 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-095292a0-6cc2-4112-95bf-15cb3dd33e9a - /dev/nbd24 > 22356 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 RBD-f8229ea0-ad7b-4034-9cbe-7353792a2b7c - /dev/nbd25 > 22537 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-e8c0b841-50ec-4765-a3cb-30c78a4b9162 - /dev/nbd26 > 15105 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-6d3d3503-2b45-45e9-a17b-30ab65c2be3d - /dev/nbd27 > 28192 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 RBD-e04ec9e6-da4c-4b7a-b257-2cf7022ac59f - /dev/nbd28 > 28507 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 RBD-e6d213b3-89d6-4c09-bc65-18ed7992149d - /dev/nbd29 > 23206 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 RBD-638ef476-843e-4c26-8202-377f185d9d26 - /dev/nbd30 > > > [root@xen-s31 ~]# uname -a > Linux xen-s31 4.4.0+10 #1 SMP Wed Dec 6 13:56:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux > > [root@xen-s31 ~]# rpm -qa|grep -P "ceph|rbd" > librbd1-12.2.5-0.el7.x86_64 > python-rbd-12.2.5-0.el7.x86_64 > ceph-common-12.2.5-0.el7.x86_64 > python-cephfs-12.2.5-0.el7.x86_64 > rbd-fuse-12.2.5-0.el7.x86_64 > libcephfs2-12.2.5-0.el7.x86_64 > rbd-nbd-12.2.5-0.el7.x86_64 > > Therefore i will try to use a 4.4 release - but i suppose that there some patch-differences between my ubuntu 4.4 and xenserver 4.4 kernel > I will test with "4.4.0-154". > > Regards > Marc > > >>> I can test with following releases: >>> >>> # apt-cache search linux-image-4.*.*.*-*-generic 2>&1|sed '~s,\.[0-9]*-[0-9]*-*-generic - .*,,;~s,linux-image-,,'|sort -u >>> 4.10 >>> 4.11 >>> 4.13 >>> 4.15 >>> 4.4 >>> 4.8 >>> >>> We can also perform tests by using another filesystem (i.e. ext4). >>> >>> From my point of view i suppose that there is something wrong nbd.ko or with rbd-nbd (excluding rbd-cache functionality) - therefore i do not think that this very promising.... >> Agreed. I would also attempt to see if you have blocked ops on the OSD during these events (see Mykola’s ticket comment). >> >>> Regards >>> Marc > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com