On 29/05/2023 22.26, Igor Fedotov wrote: > So fragmentation score calculation was improved recently indeed, see > https://github.com/ceph/ceph/pull/49885 > > > And yeah one can see some fragmentation in allocations for the first two > OSDs. Doesn't look that dramatic as fragmentation scores tell though. > > > Additionally you might want to collect free extents dump using 'ceph > tell osd.N ceph bluestore allocator dump block' command and do more > analysis on these data. > > E.g. I'd recommend to build something like a histogram showing amount of > chunks for specific size range: > > [1-4K]: N1 chunks > > (4K-16]: N2 chunks > > (16K-64K): N3 > > ... > > [16M-inf) : Nn chunks > > > This should be even more informative about fragmentation state - > particularly if observed in evolution. > > Looking for volunteers to write a script for building such a histogram... ;) I'm up for that, once I get through some other cluster maintenance I need to deal with first :) Backfill is almost done and I was finally able to destroy two OSDs, will be doing a bunch of restructuring in the coming weeks. I can probably get the script done partway through doing this, so I can see how the distributions evolve over a bunch of data movement. > > > Thanks, > > Igor > > > On 28/05/2023 08:31, Hector Martin wrote: >> So chiming in, I think something is definitely wrong with at *least* the >> frag score. >> >> Here's what happened so far: >> >> 1. I had 8 OSDs (all 8T HDDs) >> 2. I added 2 more (osd.0,1) , with Quincy defaults >> 3. I marked 2 old ones out (the ones that seemed to be struggling the >> most with IOPS) >> 4. I added 2 more (osd.2,3), but this time I had previously set >> bluestore_min_alloc_size_hdd to 16K as an experiment >> >> This has all happened in the space of a ~week. That means there was data >> movement into the first 2 new OSDs, then before that completed I added 2 >> new OSDs. So I would expect some data thashing on the first 2, but >> nothing extreme. >> >> The fragmentation scores for the 4 new OSDs are, respectively: >> >> 0.746, 0.835, 0.160, 0.067 >> >> That seems ridiculous for the first two, it's only been a week. The >> newest two seem in better shape, though those mostly would've seen only >> data moving in, not out. The rebalance isn't done yet, but it's almost >> done and all 4 OSDs have a similar fullness level at this time. >> >> Looking at alloc stats: >> >> ceph-0) allocation stats probe 6: cnt: 2219302 frags: 2328003 size: >> 1238454677504 >> ceph-0) probe -1: 1848577, 1970325, 1022324588544 >> ceph-0) probe -2: 848301, 862622, 505329963008 >> ceph-0) probe -6: 2187448, 2187448, 1055241568256 >> ceph-0) probe -14: 0, 0, 0 >> ceph-0) probe -22: 0, 0, 0 >> >> ceph-1) allocation stats probe 6: cnt: 1882396 frags: 1947321 size: >> 1054829641728 >> ceph-1) probe -1: 2212293, 2345923, 1215418728448 >> ceph-1) probe -2: 1471623, 1525498, 826984652800 >> ceph-1) probe -6: 2095298, 2095298, 1000065933312 >> ceph-1) probe -14: 0, 0, 0 >> ceph-1) probe -22: 0, 0, 0 >> >> ceph-2) allocation stats probe 3: cnt: 2760200 frags: 2760200 size: >> 1554513903616 >> ceph-2) probe -1: 2584046, 2584046, 1498140393472 >> ceph-2) probe -3: 1696921, 1696921, 869424496640 >> ceph-2) probe -7: 0, 0, 0 >> ceph-2) probe -11: 0, 0, 0 >> ceph-2) probe -19: 0, 0, 0 >> >> ceph-3) allocation stats probe 3: cnt: 2544818 frags: 2544818 size: >> 1432225021952 >> ceph-3) probe -1: 2688015, 2688015, 1515260739584 >> ceph-3) probe -3: 1086875, 1086875, 622025424896 >> ceph-3) probe -7: 0, 0, 0 >> ceph-3) probe -11: 0, 0, 0 >> ceph-3) probe -19: 0, 0, 0 >> >> So OSDs 2 and 3 (the latest ones to be added, note that these 4 new OSDs >> are 0-3 since those IDs were free) are in good shape, but 0 and 1 are >> already suffering from at least some fragmentation of objects, which is >> a bit worrying when they are only ~70% full right now and only a week old. >> >> I did delete a couple million small objects during the rebalance to try >> to reduce load (I had some nasty directories), but that was cumulatively >> only about 60GB of data. So while that could explain a high frag score >> if there are now a million little holes in the free space map of the >> OSDs (how is it calculated?), it should not actually cause new data >> moving in to end up fragmented since there should be plenty of >> unfragmented free space going around still. >> >> I am now restarting OSDs 0 and 1 to see whether that makes the frag >> score go down over time. I will do further analysis later with the raw >> bluestore free space map, since I still have a bunch of rebalancing and >> moving data around planned (I'm moving my cluster to new machines). >> >> On 26/05/2023 00.29, Igor Fedotov wrote: >>> Hi Hector, >>> >>> I can advise two tools for further fragmentation analysis: >>> >>> 1) One might want to use ceph-bluestore-tool's free-dump command to get >>> a list of free chunks for an OSD and try to analyze whether it's really >>> highly fragmented and lacks long enough extents. free-dump just returns >>> a list of extents in json format, I can take a look to the output if >>> shared... >>> >>> 2) You might want to look for allocation probs in OSD logs and see how >>> fragmentation in allocated chunks has evolved. >>> >>> E.g. >>> >>> allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508> >>> probe -1: 35168547, 46401246, 1199516209152 >>> probe -3: 27275094, 35681802, 200121712640 >>> probe -5: 34847167, 52539758, 271272230912 >>> probe -9: 44291522, 60025613, 523997483008 >>> probe -17: 10646313, 10646313, 155178434560 >>> >>> The first probe refers to the last day while others match days (or >>> rather probes) -1, -3, -5, -9, -17 >>> >>> 'cnt' column represents the amount of allocations performed in the >>> previous 24 hours and 'frags' one shows amount of fragments in the >>> resulted allocations. So significant mismatch between frags and cnt >>> might indicate some issues with high fragmentation indeed. >>> >>> Apart from retrospective analysis you might also want how OSD behavior >>> changes after reboot - e.g. wouldn't rebooted OSD produce less >>> fragmentation... Which in turn might indicate some issues with BlueStore >>> allocator.. >>> >>> Just FYI: allocation probe printing interval is controlled by >>> bluestore_alloc_stats_dump_interval parameter. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> >>> On 24/05/2023 17:18, Hector Martin wrote: >>>> On 24/05/2023 22.07, Mark Nelson wrote: >>>>> Yep, bluestore fragmentation is an issue. It's sort of a natural result >>>>> of using copy-on-write and never implementing any kind of >>>>> defragmentation scheme. Adam and I have been talking about doing it >>>>> now, probably piggybacking on scrub or other operations that already >>>>> area reading all of the extents for an object anyway. >>>>> >>>>> >>>>> I wrote a very simply prototype for clone to speed up the rbd mirror use >>>>> case here: >>>>> >>>>> https://github.com/markhpc/ceph/commit/29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5dfdf >>>>> >>>>> >>>>> Adam ended up going the extra mile and completely changed how shared >>>>> blobs works which probably eliminates the need to do defrag on clone >>>>> anymore from an rbd-mirror perspective, but I think we still need to >>>>> identify any times we are doing full object reads of fragmented objects >>>>> and consider defragmenting at that time. It might be clone, or scrub, >>>>> or other things, but the point is that if we are already doing most of >>>>> the work (seeks on HDD especially!) the extra cost of a large write to >>>>> clean it up isn't that bad, especially if we are doing it over the >>>>> course of months or years and can help keep freespace less fragmented. >>>> Note that my particular issue seemed to specifically be free space >>>> fragmentation. I don't use RBD mirror and I would not *expect* most of >>>> my cephfs use cases to lead to any weird cow/fragmentation issues with >>>> objects other than those forced by the free space becoming fragmented >>>> (unless there is some weird pathological use case I'm hitting). Most of >>>> my write workloads are just copying files in bulk and incrementally >>>> writing out files. >>>> >>>> Would simply defragging objects during scrub/etc help with free space >>>> fragmentation itself? Those seem like two somewhat unrelated issues... >>>> note that if free space is already fragmented, you wouldn't even have a >>>> place to put down a defragmented object. >>>> >>>> Are there any stats I can look at to figure out how bad object and free >>>> space fragmentation is? It would be nice to have some clearer data >>>> beyond my hunch/deduction after seeing the I/O patterns and the sole >>>> fragmentation number :). Also would be interesting to get some kind of >>>> trace of the bluestore ops the OSD is doing, so I can find out whether >>>> it's doing something pathological that causes more fragmentation for >>>> some reason. >>>> >>>>> Mark >>>>> >>>>> >>>>> On 5/24/23 07:17, Hector Martin wrote: >>>>>> Hi, >>>>>> >>>>>> I've been seeing relatively large fragmentation numbers on all my OSDs: >>>>>> >>>>>> ceph daemon osd.13 bluestore allocator score block >>>>>> { >>>>>> "fragmentation_rating": 0.77251526920454427 >>>>>> } >>>>>> >>>>>> These aren't that old, as I recreated them all around July last year. >>>>>> They mostly hold CephFS data with erasure coding, with a mix of large >>>>>> and small files. The OSDs are at around 80%-85% utilization right now. >>>>>> Most of the data was written sequentially when the OSDs were created (I >>>>>> rsynced everything from a remote backup). Since then more data has been >>>>>> added, but not particularly quickly. >>>>>> >>>>>> At some point I noticed pathologically slow writes, and I couldn't >>>>>> figure out what was wrong. Eventually I did some block tracing and >>>>>> noticed the I/Os were very small, even though CephFS-side I was just >>>>>> writing one large file sequentially, and that's when I stumbled upon the >>>>>> free space fragmentation problem. Indeed, deleting some large files >>>>>> opened up some larger free extents and resolved the problem, but only >>>>>> until those get filled up and I'm back to fragmented tiny extents. So >>>>>> effectively I'm stuck at the current utilization, as trying to fill them >>>>>> up any more just slows down to an absolute crawl. >>>>>> >>>>>> I'm adding a few more OSDs and plan on doing the dance of removing one >>>>>> OSD at a time and replacing it with another one to hopefully improve the >>>>>> situation, but obviously this is going to take forever. >>>>>> >>>>>> Is there any plan for offering a defrag tool of some sort for bluestore? >>>>>> >>>>>> - Hector >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> - Hector >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> - Hector >> - Hector _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx