Re: BlueStore fragmentation woes

Stefan Kooman <stefan@xxxxxx> · Wed, 31 May 2023 14:26:08 +0200

On 5/29/23 15:52, Igor Fedotov wrote:
Hi Stefan,

given that allocation probes include every allocation (including short 
4K ones) your stats look pretty high indeed.

Although you omitted historic probes so it's hard to tell if there is 
negative trend in it..

I did not omit them. We (currently) don't store logs for longer than 7 
days. I will increase the interval in which the probes get created 
(every hour).

As I mentioned in my reply to Hector one might want to make further 
investigation by e.g. building a histogram (chunk-size, num chanks) 
using the output from 'ceph tell osd.N bluestore allocator dump block' 
command and monitoring how  it evolves over time. Script to build such a 
histogram still to be written. ;).

We started to investigate such a script. But when we issue a "ceph tell 
osd.N bluestore allocator dump block" on OSDs that are primary for three 
or more CephFS metadata PGs, that will cause a massive amount of slow 
ops (thousands), osd op tp threads will time out 
(2023-05-31T11:52:35.454+0200 7fee13285700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fedf6fa5700' had timed out after 
15.000000954s) and the OSD will reboot itself. This is true for SSD as 
well as NVMe OSDs. So it seems that the whole OSD is just busy 
processing this data, and production IO (client / rep ops) are just 
starved. Ideally this call would be asynchronous,processed in batches, 
and not hinder IO in any way. Should I open a tracker for this?

So for us this is not a suitable way of obtaining this data. The offline 
way of doing this, ceph-bluestore-tool --path 
/var/lib/ceph/osd/ceph-$id/ --allocator block free-dump > 
/root/osd.$id_free_dump did work and resulted in a 2.7 GiB file of JSON 
data. So that's quite a bit of data to process ...

As for Pacific release being a culprit - likely it is. But there were 
two major updates which could have the impact. Both came in the same PR 
(https://github.com/ceph/ceph/pull/34588):

1. 4K allocation unit for spinners

@Kevin: what drive types do you use in the clusters that are suffering 
from this problem? Did only HDD suffer from this after upgrading to Pacific?

2. Switch to avl/hybrid allocator.

Honestly I'd rather bet on 1.

We have no spinners. We have 4K alloc size since Luminous, bitmap since 
Luminous (12.2.13?). Not sure if we are suffering (more or less) on the 
3 nodes that got provisioned / filled with hybrid allocator in use. We 
plan to do some experiments though: fill an OSD with PGs with bitmap 
allocator. At certain amount of PGs dump the free extents, until all PGs 
are present. Repeat this process with the same PGs on an OSD with hybrid 
allocator. My bet is on # 2 ;-).

 >BlueFS 4K allocation unit will not be backported to Pacific [3]. Would 
it make sense to skip re-provisiong OSDs in Pacific altogether and do 
re-provisioning in Quincy release with BlueFS 4K alloc size support [4]?

IIRC this feature doesn't require OSD redeployment - new superblock 
format is applied on-the-fly and 4K allocations are enabled immediately. 
So there is no specific requirement to re-provision OSD at Quincy+. 
Hence you're free to go with Pacific and enable 4K for BlueFS later in 
Quincy.

Ah, that's good to know.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx