On 5/31/23 09:15, Igor Fedotov wrote:
On 31/05/2023 15:26, Stefan Kooman wrote:
On 5/29/23 15:52, Igor Fedotov wrote:
Hi Stefan,
given that allocation probes include every allocation (including
short 4K ones) your stats look pretty high indeed.
Although you omitted historic probes so it's hard to tell if there
is negative trend in it..
I did not omit them. We (currently) don't store logs for longer than
7 days. I will increase the interval in which the probes get created
(every hour).
Allocation probe contains historic data on its own, e.g.
allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
probe -1: 35168547, 46401246, 1199516209152
probe -3: 27275094, 35681802, 200121712640
probe -5: 34847167, 52539758, 271272230912
probe -9: 44291522, 60025613, 523997483008
probe -17: 10646313, 10646313, 155178434560
in the snippet above probes -1 through -17 are historic data 1 through
17 days (or more correctly probe attempts) back.
The major idea behind this representation is to try to visualize how
allocation fragmentation evolved without the need for grep through all
the logs.
From the info you shared it's unclear which records were for the
current day and which were historic ones if any.
Hence no way to estimate the degradation over time.
Please note that probes are collected since OSD restart. Hence some
historic records might be void if restart occurred not long ago.
As I mentioned in my reply to Hector one might want to make further
investigation by e.g. building a histogram (chunk-size, num chanks)
using the output from 'ceph tell osd.N bluestore allocator dump
block' command and monitoring how it evolves over time. Script to
build such a histogram still to be written. ;).
We started to investigate such a script. But when we issue a "ceph
tell osd.N bluestore allocator dump block" on OSDs that are primary
for three or more CephFS metadata PGs, that will cause a massive
amount of slow ops (thousands), osd op tp threads will time out
(2023-05-31T11:52:35.454+0200 7fee13285700 1 heartbeat_map
is_healthy 'OSD::osd_op_tp thread 0x7fedf6fa5700' had timed out after
15.000000954s) and the OSD will reboot itself. This is true for SSD
as well as NVMe OSDs. So it seems that the whole OSD is just busy
processing this data, and production IO (client / rep ops) are just
starved. Ideally this call would be asynchronous,processed in
batches, and not hinder IO in any way. Should I open a tracker for this?
ah... this makes sense, good to know.. I knew that this dump might be
huge but never heard it causes that drastic impact.. Perhaps it's
really big this time or you're writing it to slow device..
Unfortunately there is no simple enough way to process that in batches
since we should collect a complete consistent snapshot made at a given
point in time. Processing in batches would create potentially
inconsistent chunks since allocation map is permanently updated by OSD
which is processing regular user ops..
So for us this is not a suitable way of obtaining this data. The
offline way of doing this, ceph-bluestore-tool --path
/var/lib/ceph/osd/ceph-$id/ --allocator block free-dump >
/root/osd.$id_free_dump did work and resulted in a 2.7 GiB file of
JSON data. So that's quite a bit of data to process ...
Yeah, offline method is fine too. In fact Ceph codebase has a way to
convert this JSON file to a binary format which might drastically
improve processing time and save disk space.
The tool name is ceph_test_alloc_replay, it's primarily intended for
dev purposes hence it's not very user-friendly. And I'm not sure it's
included in regular ceph packages, perhaps you'll need to run it
yourself.
As for Pacific release being a culprit - likely it is. But there
were two major updates which could have the impact. Both came in the
same PR (https://github.com/ceph/ceph/pull/34588):
1. 4K allocation unit for spinners
@Kevin: what drive types do you use in the clusters that are
suffering from this problem? Did only HDD suffer from this after
upgrading to Pacific?
2. Switch to avl/hybrid allocator.
Honestly I'd rather bet on 1.
We have no spinners. We have 4K alloc size since Luminous, bitmap
since Luminous (12.2.13?). Not sure if we are suffering (more or
less) on the 3 nodes that got provisioned / filled with hybrid
allocator in use. We plan to do some experiments though: fill an OSD
with PGs with bitmap allocator. At certain amount of PGs dump the
free extents, until all PGs are present. Repeat this process with the
same PGs on an OSD with hybrid allocator. My bet is on # 2 ;-).
Looking forward for the results... ;) Knowing internal design for both
bitmap and hybrid allocator I'd be very surprised the latter one is
worse in this regard...
Related to this, I was a little surprised to learn how the hybrid
allocator works. I figured we would do something like have a coarse
grained implementation of one data structure and implement the other at
the leaves. Adam was explaining that this isn't what we do though, we
just switch over at some point?
>BlueFS 4K allocation unit will not be backported to Pacific [3].
Would it make sense to skip re-provisiong OSDs in Pacific altogether
and do re-provisioning in Quincy release with BlueFS 4K alloc size
support [4]?
IIRC this feature doesn't require OSD redeployment - new superblock
format is applied on-the-fly and 4K allocations are enabled
immediately. So there is no specific requirement to re-provision OSD
at Quincy+. Hence you're free to go with Pacific and enable 4K for
BlueFS later in Quincy.
Ah, that's good to know.
Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
Best Regards,
Mark Nelson
Head of R&D (USA)
Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson@xxxxxxxxx
We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx