Let me try that on the already dead osds. For not we put these values to the config db but seems like not much help 🙁 │ osd advanced bluefs_shared_alloc_size 32768 │ osd advanced osd_max_backfills 1 │ osd advanced osd_op_thread_suicide_timeout 2000 │ osd advanced osd_op_thread_timeout 90 │ osd advanced osd_recovery_max_active 1 │ osd advanced osd_recovery_op_priority 1 ________________________________ From: Stefan Kooman <stefan@xxxxxx> Sent: Thursday, September 12, 2024 7:29 PM To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>; igor.fedotov@xxxxxxxx <igor.fedotov@xxxxxxxx> Cc: Ceph Users <ceph-users@xxxxxxx> Subject: Re: Re: bluefs _allocate unable to allocate on bdev 2 Email received from the internet. If in doubt, don't click any link nor open any attachment ! ________________________________ On 12-09-2024 11:40, Szabo, Istvan (Agoda) wrote: > Thank you, so quincy should be ok right? Yes. The problem was a spillover why > we went from separate rocksdb and wal back to not separated setup with > 4osd/ssd. > My osds are 53% full only, would it be possible to somehow increase the > default 4% somehow to 8% on an existing osd? If you still have space left on the device this should be possible [https://docs.ceph.com/en/latest/man/8/ceph-bluestore-tool/]: ceph-bluestore-tool bluefs-bdev-expand --path osd path Gr. Stefan > > ------------------------------------------------------------------------ > *From:* Stefan Kooman <stefan@xxxxxx> > *Sent:* Thursday, September 12, 2024 3:54 PM > *To:* Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>; > igor.fedotov@xxxxxxxx <igor.fedotov@xxxxxxxx> > *Cc:* Ceph Users <ceph-users@xxxxxxx> > *Subject:* Re: Re: bluefs _allocate unable to allocate on > bdev 2 > Email received from the internet. If in doubt, don't click any link nor > open any attachment ! > ________________________________ > > On 12-09-2024 06:43, Szabo, Istvan (Agoda) wrote: > > Maybe we are running into this bug Igor? > > https://github.com/ceph/ceph/pull/48854 > <https://github.com/ceph/ceph/pull/48854> > > That would be a solution for the bug you might be hitting (unable to > allocate 64K aligned blocks for RocksDB). > > I would not be surprised if you hit this issue if you are using a > workload that needs a lot of small IO (4K alloc) to be stored which > fragments the OSD which, based on your earlier posts to this list, I > believe you have. We found out (with help from Igor) that there is also > a huge performance penalty involved when OSDs are heavily fragmented, as > the RocksDB allocator will take a long time to find free blocks to use. > Note that a disk can get (heavily) fragmented within days. Like you, we > had single disk OSDs with no separate WAL/DB. We re-provisioned all OSDs > to create a separate DB on the same disk (separate LVM volume). Make > sure you make this volume big enough (check the current size of the > RocksDB). You could create the OSD in the following way: ceph-volume > lvm create --bluestore --dmcrypt --data osd.$id/osd.$id --block.db > /dev/osd.$id/osd.$id_db (omit --dmcrypt if you do not want encryption). > The documentation states between 1-4% of disk size (when RocksDB is not > using compression) [1]. > > Your other option is upgrading to a release that has 4K support for > BlueFS. But note that it is not a default setting in Ceph releases as of > yet (AFAIK), and might be a bit more risky (yet unknown side effects / > drawbacks). > > Gr. Stefan > > [1]: > https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing <https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing> > > ________________________________ > > From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> > > Sent: Thursday, September 12, 2024 6:50 AM > > To: Ceph Users <ceph-users@xxxxxxx> > > Subject: Re: bluefs _allocate unable to allocate on bdev 2 > > > > This is the end of a manual compaction and can't start actually even > after compaction: > > Meta: > https://gist.github.com/Badb0yBadb0y/f918b1e4f2d5966cefaf96d879c52a6e > <https://gist.github.com/Badb0yBadb0y/f918b1e4f2d5966cefaf96d879c52a6e> > > Log: > https://gist.github.com/Badb0yBadb0y/054a0cefd4a56f0236b26479cc1a5290 > <https://gist.github.com/Badb0yBadb0y/054a0cefd4a56f0236b26479cc1a5290> > > ________________________________ > > From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> > > Sent: Thursday, September 12, 2024 6:34 AM > > To: Ceph Users <ceph-users@xxxxxxx> > > Subject: bluefs _allocate unable to allocate on bdev 2 > > > > Hi, > > > > Since yesterday on ceph octopus it started to crash multiple osds in > the cluster and I can see this error in most of the logs: > > > > 2024-09-12T06:13:35.805+0700 7f98b8b27700 1 bluefs _allocate failed > to allocate 0xf0732 on bdev 1, free 0x40000; fallback to bdev 2 > > 2024-09-12T06:13:35.805+0700 7f98b8b27700 1 bluefs _allocate unable > to allocate 0xf0732 on bdev 2, free 0xffffffffffffffff; fallback to slow > device expander > > > > The osds are 57% full so I think space issue shouldn't happen. > > > > I'm using ssds so I don't have separated wal/rocksdb. > > > > Running some compaction now but I don't think it will happen on a > long run. > > What could be this issue and how to fix? > > > > Ty > > > > ________________________________ > > This message is confidential and is for the sole use of the intended > recipient(s). It may also be privileged or otherwise protected by > copyright or other legal rules. If you have received it by mistake > please let us know by reply email and delete it from your system. It is > prohibited to copy this message or disclose its content to anyone. Any > confidentiality or privilege is not waived or lost by any mistaken > delivery or unauthorized disclosure of the message. All messages sent to > and from Agoda may be monitored to ensure compliance with company > policies, to protect the company's interests and to remove potential > malware. Electronic messages may be intercepted, amended, lost or > deleted, or contain viruses. > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx