Re: BlueStore fragmentation woes

Mark Nelson <mark.nelson@xxxxxxxxx> · Thu, 25 May 2023 09:11:15 -0500

On 5/24/23 09:18, Hector Martin wrote:
On 24/05/2023 22.07, Mark Nelson wrote:
Yep, bluestore fragmentation is an issue.  It's sort of a natural result
of using copy-on-write and never implementing any kind of
defragmentation scheme.  Adam and I have been talking about doing it
now, probably piggybacking on scrub or other operations that already
area reading all of the extents for an object anyway.

I wrote a very simply prototype for clone to speed up the rbd mirror use
case here:

https://github.com/markhpc/ceph/commit/29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5dfdf

Adam ended up going the extra mile and completely changed how shared
blobs works which probably eliminates the need to do defrag on clone
anymore from an rbd-mirror perspective, but I think we still need to
identify any times we are doing full object reads of fragmented objects
and consider defragmenting at that time.  It might be clone, or scrub,
or other things, but the point is that if we are already doing most of
the work (seeks on HDD especially!) the extra cost of a large write to
clean it up isn't that bad, especially if we are doing it over the
course of months or years and can help keep freespace less fragmented.
Note that my particular issue seemed to specifically be free space
fragmentation. I don't use RBD mirror and I would not *expect* most of
my cephfs use cases to lead to any weird cow/fragmentation issues with
objects other than those forced by the free space becoming fragmented
(unless there is some weird pathological use case I'm hitting). Most of
my write workloads are just copying files in bulk and incrementally
writing out files.

Would simply defragging objects during scrub/etc help with free space
fragmentation itself? Those seem like two somewhat unrelated issues...
note that if free space is already fragmented, you wouldn't even have a
place to put down a defragmented object.

That is indeed one of the big issues.  If you've already fragmented 
freespace, it becomes much harder.  The approach I've been advocating 
for this is that when we scrub and encounter a heavily fragmented 
object, we do a quick search to see if we can easily find contiguous 
freespace for the whole object and if we can, we do that.  If we can't, 
we look and see if any of extents have space next to them that would 
allow an adjacent range to fit in that would allow us to improve the 
amount of contiguous space used (ie eliminating extents is a win, 
splitting an extent to add to another extent is a judgement call).  The 
idea here is that we'd try to repair the holes that get left behind when 
we do COW lazily after the fact.  When we punch a hole we'd mark it as a 
range that a new extent might easily fit back into to make things 
contiguous again.

Long term I want to restructure how bluestore works.  I think we should 
write overwrite extents to the fast device unless they are quite large 
and once we have space pressure on the fast device we write large blobs 
or perhaps whole objects to the slow device and try to keep everything 
neatly aligned (perhaps segment the disk for different object sizes, or 
even put some small objects on the fast device).  We take out the 
current blob level compression, leave small extents uncompressed on the 
fast device, and only compress when we write large extents or whole 
objects to the slow device.  The goal would be to keep fragmentation on 
the slow device low, improve behavior on HDD and QLC flash, and make 
better use of the DB/WAL devices that people put in their systems.

Mark

Are there any stats I can look at to figure out how bad object and free
space fragmentation is? It would be nice to have some clearer data
beyond my hunch/deduction after seeing the I/O patterns and the sole
fragmentation number :). Also would be interesting to get some kind of
trace of the bluestore ops the OSD is doing, so I can find out whether
it's doing something pathological that causes more fragmentation for
some reason.

Mark

On 5/24/23 07:17, Hector Martin wrote:
Hi,

I've been seeing relatively large fragmentation numbers on all my OSDs:

ceph daemon osd.13 bluestore allocator score block
{
      "fragmentation_rating": 0.77251526920454427
}

These aren't that old, as I recreated them all around July last year.
They mostly hold CephFS data with erasure coding, with a mix of large
and small files. The OSDs are at around 80%-85% utilization right now.
Most of the data was written sequentially when the OSDs were created (I
rsynced everything from a remote backup). Since then more data has been
added, but not particularly quickly.

At some point I noticed pathologically slow writes, and I couldn't
figure out what was wrong. Eventually I did some block tracing and
noticed the I/Os were very small, even though CephFS-side I was just
writing one large file sequentially, and that's when I stumbled upon the
free space fragmentation problem. Indeed, deleting some large files
opened up some larger free extents and resolved the problem, but only
until those get filled up and I'm back to fragmented tiny extents. So
effectively I'm stuck at the current utilization, as trying to fill them
up any more just slows down to an absolute crawl.

I'm adding a few more OSDs and plan on doing the dance of removing one
OSD at a time and replacing it with another one to hopefully improve the
situation, but obviously this is going to take forever.

Is there any plan for offering a defrag tool of some sort for bluestore?

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx