Hello Igor. Thanks for the answer. There are so many changes to read and test for me but I will plan an upgrade to Octopus when I'm available. Is there any problem upgrading from 14.2.16 ---> 15.2.15 ? Igor Fedotov <igor.fedotov@xxxxxxxx>, 10 Kas 2021 Çar, 17:50 tarihinde şunu yazdı: > I would encourage you to upgrade to at least the latest Nautilus (and > preferably to Octopus). > > There were a bunch of allocator's bugs fixed since 14.2.16. Not even > sure all of them landed into N since it's EOL. > > A couple examples are (both are present in the latest Nautilus): > > https://github.com/ceph/ceph/pull/41673 > > https://github.com/ceph/ceph/pull/38475 > > > Thanks, > > Igor > > > On 11/8/2021 4:31 PM, mhnx wrote: > > Hello. > > > > I'm using Nautilus 14.2.16 > > I have 30 SSD in my cluster and I use them as Bluestore OSD for RGW > index. > > Almost every week I'm losing (down) an OSD and when I check osd log I > see: > > > > -6> 2021-11-06 19:01:10.854 7fa799989c40 1 *bluefs _allocate > > failed to allocate 0xf4f04 on bdev 1, free 0xb0000; fallback to bdev > > 2* > > -5> 2021-11-06 19:01:10.854 7fa799989c40 1 *bluefs _allocate > > unable to allocate 0xf4f04 on bdev 2, free 0xffffffffffffffff; > > fallback to slow device expander* > > -4> 2021-11-06 19:01:10.854 7fa799989c40 -1 > > bluestore(/var/lib/ceph/osd/ceph-218) *allocate_bluefs_freespace > > failed to allocate on* 0x80000000 min_size 0x100000 > allocated total > > 0x0 bluefs_shared_alloc_size 0x10000 allocated 0x0 available 0x > > a497aab000 > > -3> 2021-11-06 19:01:10.854 7fa799989c40 -1 *bluefs _allocate > > failed to expand slow device to fit +0xf4f04* > > > > > > Full log: https://paste.ubuntu.com/p/MpJfVjMh7V/plain/ > > > > And OSD does not start without offline compaction. > > Offline compaction log: https://paste.ubuntu.com/p/vFZcYnxQWh/plain/ > > > > After the Offline compaction I tried to start OSD with bitmap allocator > but > > it is not getting up because of " FAILED ceph_assert(available >= > > allocated)" > > Log: https://paste.ubuntu.com/p/2Bbx983494/plain/ > > > > Then I start the OSD with hybrid allocator and let it recover. > > When the recover is done I stop the OSD and start with the bitmap > > allocator. > > This time it came up but I've got "80 slow ops, oldest one blocked for > 116 > > sec, osd.218 has slow ops" and I increased "osd_recovery_sleep 10" to > give > > a breath to cluster and cluster marked the osd as down (it was still > > working) after a while the osd marked up and cluster became normal. But > > while recovering, other osd's started to give slow ops and I've played > > around with "osd_recovery_sleep 0.1 <---> 10" to keep the cluster stable > > till recovery finishes. > > > > Ceph osd df tree before: https://paste.ubuntu.com/p/4K7JXcZ8FJ/plain/ > > Ceph osd df tree after osd.218 = bitmap: > > https://paste.ubuntu.com/p/5SKbhrbgVM/plain/ > > > > If I want to change all other osd's allocator to bitmap, I need to repeat > > the process 29 time and it will take too much time. > > I don't want to heal OSDs with the offline compaction anymore so I will > do > > that if that's the solution but I want to be sure before doing a lot of > > work and maybe with the issue I can provide helpful logs and information > > for developers. > > > > Have a nice day. > > Thanks. > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > -- > Igor Fedotov > Ceph Lead Developer > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx