Re: BlueStore fragmentation woes

Igor Fedotov <igor.fedotov@xxxxxxxx> · Thu, 25 May 2023 20:14:37 +0300

Yeah this looks fine. Please collect all of them for a given OSD.

Then restart OSD, wait more to come (1-2 days) and collect them too.

A side note - in the attached probe I can't see any fragmentation at all 
- amount of allocations is equal to amount of fragments, e.g.

cnt: 27637 frags: 27637

And the average requested chunk is 63777406976 / 27637 = ~2308 bytes. 
I.e. in average one needed less than a single alloc unit. Which would 
tell us nothing about the fragmentation...

Thanks,
Igor

On 25/05/2023 19:36, Fox, Kevin M wrote:
Ok, I'm gathering the "allocation stats probe" stuff. Not sure I follow what you mean by the historic probes. just:
| egrep "allocation stats probe|probe"   ?

That gets something like:
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183)  allocation stats probe 110: cnt: 27637 frags: 27637 size: 63777406976
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183)  probe -1: 24503,  24503, 58141900800
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183)  probe -2: 24594,  24594, 56951898112
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183)  probe -6: 19737,  19737, 37299027968
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183)  probe -14: 20373,  20373, 35302801408
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183)  probe -30: 19072,  19072, 33645854720

if that is the right query, then I'll gather the metrics, restart and gather some more after and let you know.

Thanks,
Kevin

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: Thursday, May 25, 2023 9:29 AM
To: Fox, Kevin M; Hector Martin; ceph-users@xxxxxxx
Subject: Re:  Re: BlueStore fragmentation woes

Just run through available logs for a specific OSD (which you suspect
suffer from high fragmentation) and collect all allocation stats probes
you can find ("allocation stats probe" string is a perfect grep pattern,
please append lines with historic probes following day-0 line as well.
Given this is printed once per day there wouldn't be too many).

Then do OSD restart and wait a couple more days. Would allocation stats
show much better disparity between cnt and frags columns?

Is the similar pattern (eventual degradation in stats prior to restart
and severe improvement afterwards) be observed for other OSDs?

On 25/05/2023 19:20, Fox, Kevin M wrote:
If you can give me instructions on what you want me to gather before the restart and after restart I can do it. I have some running away right now.

Thanks,
Kevin

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: Thursday, May 25, 2023 9:17 AM
To: Fox, Kevin M; Hector Martin; ceph-users@xxxxxxx
Subject: Re:  Re: BlueStore fragmentation woes

Perhaps...

I don't like the idea to use fragmentation score as a real index. IMO
it's mostly like a very imprecise first turn marker to alert that
something might be wrong. But not a real quantitative high-quality estimate.

So in fact I'd like to see a series of allocation probes showing
eventual degradation without OSD restart and immediate severe
improvement after the restart.

Can you try to collect something like that? Would the same behavior
persist with an alternative allocator?

Thanks,

Igor

On 25/05/2023 18:41, Fox, Kevin M wrote:
Is this related to https://tracker.ceph.com/issues/58022 ?

We still see run away osds at times, somewhat randomly, that causes runaway fragmentation issues.

Thanks,
Kevin

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: Thursday, May 25, 2023 8:29 AM
To: Hector Martin; ceph-users@xxxxxxx
Subject:  Re: BlueStore fragmentation woes

Check twice before you click! This email originated from outside PNNL.

Hi Hector,

I can advise two tools for further fragmentation analysis:

1) One might want to use ceph-bluestore-tool's free-dump command to get
a list of free chunks for an OSD and try to analyze whether it's really
highly fragmented and lacks long enough extents. free-dump just returns
a list of extents in json format, I can take a look to the output if
shared...

2) You might want to look for allocation probs in OSD logs and see how
fragmentation in allocated chunks has evolved.

E.g.

allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
probe -1: 35168547,  46401246, 1199516209152
probe -3: 27275094,  35681802, 200121712640
probe -5: 34847167,  52539758, 271272230912
probe -9: 44291522,  60025613, 523997483008
probe -17: 10646313,  10646313, 155178434560

The first probe refers to the last day while others match days (or
rather probes) -1, -3, -5, -9, -17

'cnt' column represents the amount of allocations performed in the
previous 24 hours and 'frags' one shows amount of fragments in the
resulted allocations. So significant mismatch between frags and cnt
might indicate some issues with high fragmentation indeed.

Apart from retrospective analysis you might also want how OSD behavior
changes after reboot - e.g. wouldn't rebooted OSD produce less
fragmentation... Which in turn might indicate some issues with BlueStore
allocator..

Just FYI: allocation probe printing interval is controlled by
bluestore_alloc_stats_dump_interval parameter.

Thanks,

Igor

On 24/05/2023 17:18, Hector Martin wrote:
On 24/05/2023 22.07, Mark Nelson wrote:
Yep, bluestore fragmentation is an issue.  It's sort of a natural result
of using copy-on-write and never implementing any kind of
defragmentation scheme.  Adam and I have been talking about doing it
now, probably piggybacking on scrub or other operations that already
area reading all of the extents for an object anyway.

I wrote a very simply prototype for clone to speed up the rbd mirror use
case here:

https://github.com/markhpc/ceph/commit/29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5dfdf

Adam ended up going the extra mile and completely changed how shared
blobs works which probably eliminates the need to do defrag on clone
anymore from an rbd-mirror perspective, but I think we still need to
identify any times we are doing full object reads of fragmented objects
and consider defragmenting at that time.  It might be clone, or scrub,
or other things, but the point is that if we are already doing most of
the work (seeks on HDD especially!) the extra cost of a large write to
clean it up isn't that bad, especially if we are doing it over the
course of months or years and can help keep freespace less fragmented.
Note that my particular issue seemed to specifically be free space
fragmentation. I don't use RBD mirror and I would not *expect* most of
my cephfs use cases to lead to any weird cow/fragmentation issues with
objects other than those forced by the free space becoming fragmented
(unless there is some weird pathological use case I'm hitting). Most of
my write workloads are just copying files in bulk and incrementally
writing out files.

Would simply defragging objects during scrub/etc help with free space
fragmentation itself? Those seem like two somewhat unrelated issues...
note that if free space is already fragmented, you wouldn't even have a
place to put down a defragmented object.

Are there any stats I can look at to figure out how bad object and free
space fragmentation is? It would be nice to have some clearer data
beyond my hunch/deduction after seeing the I/O patterns and the sole
fragmentation number :). Also would be interesting to get some kind of
trace of the bluestore ops the OSD is doing, so I can find out whether
it's doing something pathological that causes more fragmentation for
some reason.

Mark

On 5/24/23 07:17, Hector Martin wrote:
Hi,

I've been seeing relatively large fragmentation numbers on all my OSDs:

ceph daemon osd.13 bluestore allocator score block
{
         "fragmentation_rating": 0.77251526920454427
}

These aren't that old, as I recreated them all around July last year.
They mostly hold CephFS data with erasure coding, with a mix of large
and small files. The OSDs are at around 80%-85% utilization right now.
Most of the data was written sequentially when the OSDs were created (I
rsynced everything from a remote backup). Since then more data has been
added, but not particularly quickly.

At some point I noticed pathologically slow writes, and I couldn't
figure out what was wrong. Eventually I did some block tracing and
noticed the I/Os were very small, even though CephFS-side I was just
writing one large file sequentially, and that's when I stumbled upon the
free space fragmentation problem. Indeed, deleting some large files
opened up some larger free extents and resolved the problem, but only
until those get filled up and I'm back to fragmented tiny extents. So
effectively I'm stuck at the current utilization, as trying to fill them
up any more just slows down to an absolute crawl.

I'm adding a few more OSDs and plan on doing the dance of removing one
OSD at a time and replacing it with another one to hopefully improve the
situation, but obviously this is going to take forever.

Is there any plan for offering a defrag tool of some sort for bluestore?

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx