Re: BlueStore fragmentation woes

Hector Martin <marcan@xxxxxxxxx> · Mon, 29 May 2023 23:44:32 +0900

On 29/05/2023 22.26, Igor Fedotov wrote:
> So fragmentation score calculation was improved recently indeed, see 
> https://github.com/ceph/ceph/pull/49885
> 
> 
> And yeah one can see some fragmentation in allocations for the first two
> OSDs. Doesn't look that dramatic as fragmentation scores tell though.
> 
> 
> Additionally you might want to collect free extents dump using 'ceph
> tell osd.N ceph bluestore allocator dump block' command and do more
> analysis on these data.
> 
> E.g. I'd recommend to build something like a histogram showing amount of
> chunks for specific size range:
> 
> [1-4K]: N1 chunks
> 
> (4K-16]: N2 chunks
> 
> (16K-64K): N3
> 
> ...
> 
> [16M-inf) : Nn chunks
> 
> 
> This should be even more informative about fragmentation state -
> particularly if observed in evolution.
> 
> Looking for volunteers to write a script for building such a histogram... ;)

I'm up for that, once I get through some other cluster maintenance I
need to deal with first :)

Backfill is almost done and I was finally able to destroy two OSDs, will
be doing a bunch of restructuring in the coming weeks. I can probably
get the script done partway through doing this, so I can see how the
distributions evolve over a bunch of data movement.

> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 28/05/2023 08:31, Hector Martin wrote:
>> So chiming in, I think something is definitely wrong with at *least* the
>> frag score.
>>
>> Here's what happened so far:
>>
>> 1. I had 8 OSDs (all 8T HDDs)
>> 2. I added 2 more (osd.0,1) , with Quincy defaults
>> 3. I marked 2 old ones out (the ones that seemed to be struggling the
>> most with IOPS)
>> 4. I added 2 more (osd.2,3), but this time I had previously set
>> bluestore_min_alloc_size_hdd to 16K as an experiment
>>
>> This has all happened in the space of a ~week. That means there was data
>> movement into the first 2 new OSDs, then before that completed I added 2
>> new OSDs. So I would expect some data thashing on the first 2, but
>> nothing extreme.
>>
>> The fragmentation scores for the 4 new OSDs are, respectively:
>>
>> 0.746, 0.835, 0.160, 0.067
>>
>> That seems ridiculous for the first two, it's only been a week. The
>> newest two seem in better shape, though those mostly would've seen only
>> data moving in, not out. The rebalance isn't done yet, but it's almost
>> done and all 4 OSDs have a similar fullness level at this time.
>>
>> Looking at alloc stats:
>>
>> ceph-0)  allocation stats probe 6: cnt: 2219302 frags: 2328003 size:
>> 1238454677504
>> ceph-0)  probe -1: 1848577,  1970325, 1022324588544
>> ceph-0)  probe -2: 848301,  862622, 505329963008
>> ceph-0)  probe -6: 2187448,  2187448, 1055241568256
>> ceph-0)  probe -14: 0,  0, 0
>> ceph-0)  probe -22: 0,  0, 0
>>
>> ceph-1)  allocation stats probe 6: cnt: 1882396 frags: 1947321 size:
>> 1054829641728
>> ceph-1)  probe -1: 2212293,  2345923, 1215418728448
>> ceph-1)  probe -2: 1471623,  1525498, 826984652800
>> ceph-1)  probe -6: 2095298,  2095298, 1000065933312
>> ceph-1)  probe -14: 0,  0, 0
>> ceph-1)  probe -22: 0,  0, 0
>>
>> ceph-2)  allocation stats probe 3: cnt: 2760200 frags: 2760200 size:
>> 1554513903616
>> ceph-2)  probe -1: 2584046,  2584046, 1498140393472
>> ceph-2)  probe -3: 1696921,  1696921, 869424496640
>> ceph-2)  probe -7: 0,  0, 0
>> ceph-2)  probe -11: 0,  0, 0
>> ceph-2)  probe -19: 0,  0, 0
>>
>> ceph-3)  allocation stats probe 3: cnt: 2544818 frags: 2544818 size:
>> 1432225021952
>> ceph-3)  probe -1: 2688015,  2688015, 1515260739584
>> ceph-3)  probe -3: 1086875,  1086875, 622025424896
>> ceph-3)  probe -7: 0,  0, 0
>> ceph-3)  probe -11: 0,  0, 0
>> ceph-3)  probe -19: 0,  0, 0
>>
>> So OSDs 2 and 3 (the latest ones to be added, note that these 4 new OSDs
>> are 0-3 since those IDs were free) are in good shape, but 0 and 1 are
>> already suffering from at least some fragmentation of objects, which is
>> a bit worrying when they are only ~70% full right now and only a week old.
>>
>> I did delete a couple million small objects during the rebalance to try
>> to reduce load (I had some nasty directories), but that was cumulatively
>> only about 60GB of data. So while that could explain a high frag score
>> if there are now a million little holes in the free space map of the
>> OSDs (how is it calculated?), it should not actually cause new data
>> moving in to end up fragmented since there should be plenty of
>> unfragmented free space going around still.
>>
>> I am now restarting OSDs 0 and 1 to see whether that makes the frag
>> score go down over time. I will do further analysis later with the raw
>> bluestore free space map, since I still have a bunch of rebalancing and
>> moving data around planned (I'm moving my cluster to new machines).
>>
>> On 26/05/2023 00.29, Igor Fedotov wrote:
>>> Hi Hector,
>>>
>>> I can advise two tools for further fragmentation analysis:
>>>
>>> 1) One might want to use ceph-bluestore-tool's free-dump command to get 
>>> a list of free chunks for an OSD and try to analyze whether it's really 
>>> highly fragmented and lacks long enough extents. free-dump just returns 
>>> a list of extents in json format, I can take a look to the output if 
>>> shared...
>>>
>>> 2) You might want to look for allocation probs in OSD logs and see how 
>>> fragmentation in allocated chunks has evolved.
>>>
>>> E.g.
>>>
>>> allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
>>> probe -1: 35168547,  46401246, 1199516209152
>>> probe -3: 27275094,  35681802, 200121712640
>>> probe -5: 34847167,  52539758, 271272230912
>>> probe -9: 44291522,  60025613, 523997483008
>>> probe -17: 10646313,  10646313, 155178434560
>>>
>>> The first probe refers to the last day while others match days (or 
>>> rather probes) -1, -3, -5, -9, -17
>>>
>>> 'cnt' column represents the amount of allocations performed in the 
>>> previous 24 hours and 'frags' one shows amount of fragments in the 
>>> resulted allocations. So significant mismatch between frags and cnt 
>>> might indicate some issues with high fragmentation indeed.
>>>
>>> Apart from retrospective analysis you might also want how OSD behavior 
>>> changes after reboot - e.g. wouldn't rebooted OSD produce less 
>>> fragmentation... Which in turn might indicate some issues with BlueStore 
>>> allocator..
>>>
>>> Just FYI: allocation probe printing interval is controlled by 
>>> bluestore_alloc_stats_dump_interval parameter.
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>>
>>> On 24/05/2023 17:18, Hector Martin wrote:
>>>> On 24/05/2023 22.07, Mark Nelson wrote:
>>>>> Yep, bluestore fragmentation is an issue.  It's sort of a natural result
>>>>> of using copy-on-write and never implementing any kind of
>>>>> defragmentation scheme.  Adam and I have been talking about doing it
>>>>> now, probably piggybacking on scrub or other operations that already
>>>>> area reading all of the extents for an object anyway.
>>>>>
>>>>>
>>>>> I wrote a very simply prototype for clone to speed up the rbd mirror use
>>>>> case here:
>>>>>
>>>>> https://github.com/markhpc/ceph/commit/29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5dfdf
>>>>>
>>>>>
>>>>> Adam ended up going the extra mile and completely changed how shared
>>>>> blobs works which probably eliminates the need to do defrag on clone
>>>>> anymore from an rbd-mirror perspective, but I think we still need to
>>>>> identify any times we are doing full object reads of fragmented objects
>>>>> and consider defragmenting at that time.  It might be clone, or scrub,
>>>>> or other things, but the point is that if we are already doing most of
>>>>> the work (seeks on HDD especially!) the extra cost of a large write to
>>>>> clean it up isn't that bad, especially if we are doing it over the
>>>>> course of months or years and can help keep freespace less fragmented.
>>>> Note that my particular issue seemed to specifically be free space
>>>> fragmentation. I don't use RBD mirror and I would not *expect* most of
>>>> my cephfs use cases to lead to any weird cow/fragmentation issues with
>>>> objects other than those forced by the free space becoming fragmented
>>>> (unless there is some weird pathological use case I'm hitting). Most of
>>>> my write workloads are just copying files in bulk and incrementally
>>>> writing out files.
>>>>
>>>> Would simply defragging objects during scrub/etc help with free space
>>>> fragmentation itself? Those seem like two somewhat unrelated issues...
>>>> note that if free space is already fragmented, you wouldn't even have a
>>>> place to put down a defragmented object.
>>>>
>>>> Are there any stats I can look at to figure out how bad object and free
>>>> space fragmentation is? It would be nice to have some clearer data
>>>> beyond my hunch/deduction after seeing the I/O patterns and the sole
>>>> fragmentation number :). Also would be interesting to get some kind of
>>>> trace of the bluestore ops the OSD is doing, so I can find out whether
>>>> it's doing something pathological that causes more fragmentation for
>>>> some reason.
>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On 5/24/23 07:17, Hector Martin wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've been seeing relatively large fragmentation numbers on all my OSDs:
>>>>>>
>>>>>> ceph daemon osd.13 bluestore allocator score block
>>>>>> {
>>>>>>       "fragmentation_rating": 0.77251526920454427
>>>>>> }
>>>>>>
>>>>>> These aren't that old, as I recreated them all around July last year.
>>>>>> They mostly hold CephFS data with erasure coding, with a mix of large
>>>>>> and small files. The OSDs are at around 80%-85% utilization right now.
>>>>>> Most of the data was written sequentially when the OSDs were created (I
>>>>>> rsynced everything from a remote backup). Since then more data has been
>>>>>> added, but not particularly quickly.
>>>>>>
>>>>>> At some point I noticed pathologically slow writes, and I couldn't
>>>>>> figure out what was wrong. Eventually I did some block tracing and
>>>>>> noticed the I/Os were very small, even though CephFS-side I was just
>>>>>> writing one large file sequentially, and that's when I stumbled upon the
>>>>>> free space fragmentation problem. Indeed, deleting some large files
>>>>>> opened up some larger free extents and resolved the problem, but only
>>>>>> until those get filled up and I'm back to fragmented tiny extents. So
>>>>>> effectively I'm stuck at the current utilization, as trying to fill them
>>>>>> up any more just slows down to an absolute crawl.
>>>>>>
>>>>>> I'm adding a few more OSDs and plan on doing the dance of removing one
>>>>>> OSD at a time and replacing it with another one to hopefully improve the
>>>>>> situation, but obviously this is going to take forever.
>>>>>>
>>>>>> Is there any plan for offering a defrag tool of some sort for bluestore?
>>>>>>
>>>>>> - Hector
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> - Hector
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> - Hector
>>

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx