Re: BlueStore fragmentation woes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/29/23 15:52, Igor Fedotov wrote:
Hi Stefan,

given that allocation probes include every allocation (including short 4K ones) your stats look pretty high indeed.

Although you omitted historic probes so it's hard to tell if there is negative trend in it..

I did not omit them. We (currently) don't store logs for longer than 7 days. I will increase the interval in which the probes get created (every hour).


As I mentioned in my reply to Hector one might want to make further investigation by e.g. building a histogram (chunk-size, num chanks) using the output from 'ceph tell osd.N bluestore allocator dump block' command and monitoring how  it evolves over time. Script to build such a histogram still to be written. ;).

We started to investigate such a script. But when we issue a "ceph tell osd.N bluestore allocator dump block" on OSDs that are primary for three or more CephFS metadata PGs, that will cause a massive amount of slow ops (thousands), osd op tp threads will time out (2023-05-31T11:52:35.454+0200 7fee13285700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fedf6fa5700' had timed out after 15.000000954s) and the OSD will reboot itself. This is true for SSD as well as NVMe OSDs. So it seems that the whole OSD is just busy processing this data, and production IO (client / rep ops) are just starved. Ideally this call would be asynchronous,processed in batches, and not hinder IO in any way. Should I open a tracker for this?

So for us this is not a suitable way of obtaining this data. The offline way of doing this, ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-$id/ --allocator block free-dump > /root/osd.$id_free_dump did work and resulted in a 2.7 GiB file of JSON data. So that's quite a bit of data to process ...


As for Pacific release being a culprit - likely it is. But there were two major updates which could have the impact. Both came in the same PR (https://github.com/ceph/ceph/pull/34588):

1. 4K allocation unit for spinners

@Kevin: what drive types do you use in the clusters that are suffering from this problem? Did only HDD suffer from this after upgrading to Pacific?

2. Switch to avl/hybrid allocator.

Honestly I'd rather bet on 1.

We have no spinners. We have 4K alloc size since Luminous, bitmap since Luminous (12.2.13?). Not sure if we are suffering (more or less) on the 3 nodes that got provisioned / filled with hybrid allocator in use. We plan to do some experiments though: fill an OSD with PGs with bitmap allocator. At certain amount of PGs dump the free extents, until all PGs are present. Repeat this process with the same PGs on an OSD with hybrid allocator. My bet is on # 2 ;-).


>BlueFS 4K allocation unit will not be backported to Pacific [3]. Would it make sense to skip re-provisiong OSDs in Pacific altogether and do re-provisioning in Quincy release with BlueFS 4K alloc size support [4]?

IIRC this feature doesn't require OSD redeployment - new superblock format is applied on-the-fly and 4K allocations are enabled immediately. So there is no specific requirement to re-provision OSD at Quincy+. Hence you're free to go with Pacific and enable 4K for BlueFS later in Quincy.

Ah, that's good to know.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux