Re: BlueStore fragmentation woes

Igor Fedotov <igor.fedotov@xxxxxxxx> · Wed, 31 May 2023 17:15:19 +0300

On 31/05/2023 15:26, Stefan Kooman wrote:
On 5/29/23 15:52, Igor Fedotov wrote:
Hi Stefan,

given that allocation probes include every allocation (including 
short 4K ones) your stats look pretty high indeed.

Although you omitted historic probes so it's hard to tell if there is 
negative trend in it..

I did not omit them. We (currently) don't store logs for longer than 7 
days. I will increase the interval in which the probes get created 
(every hour).

Allocation probe contains historic data on its own, e.g.

allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
probe -1: 35168547,  46401246, 1199516209152
probe -3: 27275094,  35681802, 200121712640
probe -5: 34847167,  52539758, 271272230912
probe -9: 44291522,  60025613, 523997483008
probe -17: 10646313,  10646313, 155178434560

in the snippet above probes -1 through -17 are historic data 1 through 
17 days (or more correctly probe attempts) back.

The major idea behind this representation is to try to visualize how 
allocation fragmentation evolved without the need for grep through all 
the logs.

From the info you shared it's unclear which records were for the 
current day and which were historic ones if any.

Hence no way to estimate the degradation over time.

Please note that probes are collected since OSD restart. Hence some 
historic records might be void if restart occurred not long ago.

As I mentioned in my reply to Hector one might want to make further 
investigation by e.g. building a histogram (chunk-size, num chanks) 
using the output from 'ceph tell osd.N bluestore allocator dump 
block' command and monitoring how  it evolves over time. Script to 
build such a histogram still to be written. ;).

We started to investigate such a script. But when we issue a "ceph 
tell osd.N bluestore allocator dump block" on OSDs that are primary 
for three or more CephFS metadata PGs, that will cause a massive 
amount of slow ops (thousands), osd op tp threads will time out 
(2023-05-31T11:52:35.454+0200 7fee13285700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fedf6fa5700' had timed out after 
15.000000954s) and the OSD will reboot itself. This is true for SSD as 
well as NVMe OSDs. So it seems that the whole OSD is just busy 
processing this data, and production IO (client / rep ops) are just 
starved. Ideally this call would be asynchronous,processed in batches, 
and not hinder IO in any way. Should I open a tracker for this?

ah... this makes sense, good to know.. I knew that this dump might be 
huge but never heard it causes that drastic impact.. Perhaps it's really 
big this time or you're writing it to slow device..

Unfortunately there is no simple enough way to process that in batches 
since we should collect a complete consistent snapshot made at a given 
point in time. Processing in batches would create potentially 
inconsistent chunks since allocation map is permanently updated by OSD 
which is processing regular user ops..

So for us this is not a suitable way of obtaining this data. The 
offline way of doing this, ceph-bluestore-tool --path 
/var/lib/ceph/osd/ceph-$id/ --allocator block free-dump > 
/root/osd.$id_free_dump did work and resulted in a 2.7 GiB file of 
JSON data. So that's quite a bit of data to process ...

Yeah, offline method is fine too. In fact Ceph codebase has a way to 
convert this JSON file to a binary format which might drastically 
improve processing time and save disk space.

The tool name is ceph_test_alloc_replay, it's primarily intended for dev 
purposes hence it's not very user-friendly. And I'm not sure it's 
included in regular ceph packages, perhaps you'll need to run it yourself.

As for Pacific release being a culprit - likely it is. But there were 
two major updates which could have the impact. Both came in the same 
PR (https://github.com/ceph/ceph/pull/34588):

1. 4K allocation unit for spinners

@Kevin: what drive types do you use in the clusters that are suffering 
from this problem? Did only HDD suffer from this after upgrading to 
Pacific?

2. Switch to avl/hybrid allocator.

Honestly I'd rather bet on 1.

We have no spinners. We have 4K alloc size since Luminous, bitmap 
since Luminous (12.2.13?). Not sure if we are suffering (more or less) 
on the 3 nodes that got provisioned / filled with hybrid allocator in 
use. We plan to do some experiments though: fill an OSD with PGs with 
bitmap allocator. At certain amount of PGs dump the free extents, 
until all PGs are present. Repeat this process with the same PGs on an 
OSD with hybrid allocator. My bet is on # 2 ;-).

Looking forward for the results... ;) Knowing internal design for both 
bitmap and hybrid allocator I'd be very surprised the latter one is 
worse in this regard...

 >BlueFS 4K allocation unit will not be backported to Pacific [3]. 
Would it make sense to skip re-provisiong OSDs in Pacific altogether 
and do re-provisioning in Quincy release with BlueFS 4K alloc size 
support [4]?

IIRC this feature doesn't require OSD redeployment - new superblock 
format is applied on-the-fly and 4K allocations are enabled 
immediately. So there is no specific requirement to re-provision OSD 
at Quincy+. Hence you're free to go with Pacific and enable 4K for 
BlueFS later in Quincy.

Ah, that's good to know.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx