Re: BlueStore fragmentation woes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 31/05/2023 15:26, Stefan Kooman wrote:
On 5/29/23 15:52, Igor Fedotov wrote:
Hi Stefan,

given that allocation probes include every allocation (including short 4K ones) your stats look pretty high indeed.

Although you omitted historic probes so it's hard to tell if there is negative trend in it..

I did not omit them. We (currently) don't store logs for longer than 7 days. I will increase the interval in which the probes get created (every hour).

Allocation probe contains historic data on its own, e.g.

allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
probe -1: 35168547,  46401246, 1199516209152
probe -3: 27275094,  35681802, 200121712640
probe -5: 34847167,  52539758, 271272230912
probe -9: 44291522,  60025613, 523997483008
probe -17: 10646313,  10646313, 155178434560

in the snippet above probes -1 through -17 are historic data 1 through 17 days (or more correctly probe attempts) back.

The major idea behind this representation is to try to visualize how allocation fragmentation evolved without the need for grep through all the logs.

From the info you shared it's unclear which records were for the current day and which were historic ones if any.

Hence no way to estimate the degradation over time.

Please note that probes are collected since OSD restart. Hence some historic records might be void if restart occurred not long ago.



As I mentioned in my reply to Hector one might want to make further investigation by e.g. building a histogram (chunk-size, num chanks) using the output from 'ceph tell osd.N bluestore allocator dump block' command and monitoring how  it evolves over time. Script to build such a histogram still to be written. ;).

We started to investigate such a script. But when we issue a "ceph tell osd.N bluestore allocator dump block" on OSDs that are primary for three or more CephFS metadata PGs, that will cause a massive amount of slow ops (thousands), osd op tp threads will time out (2023-05-31T11:52:35.454+0200 7fee13285700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fedf6fa5700' had timed out after 15.000000954s) and the OSD will reboot itself. This is true for SSD as well as NVMe OSDs. So it seems that the whole OSD is just busy processing this data, and production IO (client / rep ops) are just starved. Ideally this call would be asynchronous,processed in batches, and not hinder IO in any way. Should I open a tracker for this?

ah... this makes sense, good to know.. I knew that this dump might be huge but never heard it causes that drastic impact.. Perhaps it's really big this time or you're writing it to slow device..

Unfortunately there is no simple enough way to process that in batches since we should collect a complete consistent snapshot made at a given point in time. Processing in batches would create potentially inconsistent chunks since allocation map is permanently updated by OSD which is processing regular user ops..

So for us this is not a suitable way of obtaining this data. The offline way of doing this, ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-$id/ --allocator block free-dump > /root/osd.$id_free_dump did work and resulted in a 2.7 GiB file of JSON data. So that's quite a bit of data to process ...

Yeah, offline method is fine too. In fact Ceph codebase has a way to convert this JSON file to a binary format which might drastically improve processing time and save disk space.

The tool name is ceph_test_alloc_replay, it's primarily intended for dev purposes hence it's not very user-friendly. And I'm not sure it's included in regular ceph packages, perhaps you'll need to run it yourself.



As for Pacific release being a culprit - likely it is. But there were two major updates which could have the impact. Both came in the same PR (https://github.com/ceph/ceph/pull/34588):

1. 4K allocation unit for spinners

@Kevin: what drive types do you use in the clusters that are suffering from this problem? Did only HDD suffer from this after upgrading to Pacific?

2. Switch to avl/hybrid allocator.

Honestly I'd rather bet on 1.

We have no spinners. We have 4K alloc size since Luminous, bitmap since Luminous (12.2.13?). Not sure if we are suffering (more or less) on the 3 nodes that got provisioned / filled with hybrid allocator in use. We plan to do some experiments though: fill an OSD with PGs with bitmap allocator. At certain amount of PGs dump the free extents, until all PGs are present. Repeat this process with the same PGs on an OSD with hybrid allocator. My bet is on # 2 ;-).

Looking forward for the results... ;) Knowing internal design for both bitmap and hybrid allocator I'd be very surprised the latter one is worse in this regard...



 >BlueFS 4K allocation unit will not be backported to Pacific [3]. Would it make sense to skip re-provisiong OSDs in Pacific altogether and do re-provisioning in Quincy release with BlueFS 4K alloc size support [4]?

IIRC this feature doesn't require OSD redeployment - new superblock format is applied on-the-fly and 4K allocations are enabled immediately. So there is no specific requirement to re-provision OSD at Quincy+. Hence you're free to go with Pacific and enable 4K for BlueFS later in Quincy.

Ah, that's good to know.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux