Thank you, so quincy should be ok right? The problem was a spillover why we went from separate rocksdb and wal back to not separated setup with 4osd/ssd. My osds are 53% full only, would it be possible to somehow increase the default 4% somehow to 8% on an existing osd? ________________________________ From: Stefan Kooman <stefan@xxxxxx> Sent: Thursday, September 12, 2024 3:54 PM To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>; igor.fedotov@xxxxxxxx <igor.fedotov@xxxxxxxx> Cc: Ceph Users <ceph-users@xxxxxxx> Subject: Re: Re: bluefs _allocate unable to allocate on bdev 2 Email received from the internet. If in doubt, don't click any link nor open any attachment ! ________________________________ On 12-09-2024 06:43, Szabo, Istvan (Agoda) wrote: > Maybe we are running into this bug Igor? > https://github.com/ceph/ceph/pull/48854 That would be a solution for the bug you might be hitting (unable to allocate 64K aligned blocks for RocksDB). I would not be surprised if you hit this issue if you are using a workload that needs a lot of small IO (4K alloc) to be stored which fragments the OSD which, based on your earlier posts to this list, I believe you have. We found out (with help from Igor) that there is also a huge performance penalty involved when OSDs are heavily fragmented, as the RocksDB allocator will take a long time to find free blocks to use. Note that a disk can get (heavily) fragmented within days. Like you, we had single disk OSDs with no separate WAL/DB. We re-provisioned all OSDs to create a separate DB on the same disk (separate LVM volume). Make sure you make this volume big enough (check the current size of the RocksDB). You could create the OSD in the following way: ceph-volume lvm create --bluestore --dmcrypt --data osd.$id/osd.$id --block.db /dev/osd.$id/osd.$id_db (omit --dmcrypt if you do not want encryption). The documentation states between 1-4% of disk size (when RocksDB is not using compression) [1]. Your other option is upgrading to a release that has 4K support for BlueFS. But note that it is not a default setting in Ceph releases as of yet (AFAIK), and might be a bit more risky (yet unknown side effects / drawbacks). Gr. Stefan [1]: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing > ________________________________ > From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> > Sent: Thursday, September 12, 2024 6:50 AM > To: Ceph Users <ceph-users@xxxxxxx> > Subject: Re: bluefs _allocate unable to allocate on bdev 2 > > This is the end of a manual compaction and can't start actually even after compaction: > Meta: https://gist.github.com/Badb0yBadb0y/f918b1e4f2d5966cefaf96d879c52a6e > Log: https://gist.github.com/Badb0yBadb0y/054a0cefd4a56f0236b26479cc1a5290 > ________________________________ > From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> > Sent: Thursday, September 12, 2024 6:34 AM > To: Ceph Users <ceph-users@xxxxxxx> > Subject: bluefs _allocate unable to allocate on bdev 2 > > Hi, > > Since yesterday on ceph octopus it started to crash multiple osds in the cluster and I can see this error in most of the logs: > > 2024-09-12T06:13:35.805+0700 7f98b8b27700 1 bluefs _allocate failed to allocate 0xf0732 on bdev 1, free 0x40000; fallback to bdev 2 > 2024-09-12T06:13:35.805+0700 7f98b8b27700 1 bluefs _allocate unable to allocate 0xf0732 on bdev 2, free 0xffffffffffffffff; fallback to slow device expander > > The osds are 57% full so I think space issue shouldn't happen. > > I'm using ssds so I don't have separated wal/rocksdb. > > Running some compaction now but I don't think it will happen on a long run. > What could be this issue and how to fix? > > Ty > > ________________________________ > This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx