Re: BlueStore fragmentation woes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yep, bluestore fragmentation is an issue.  It's sort of a natural result of using copy-on-write and never implementing any kind of defragmentation scheme.  Adam and I have been talking about doing it now, probably piggybacking on scrub or other operations that already area reading all of the extents for an object anyway.


I wrote a very simply prototype for clone to speed up the rbd mirror use case here:

https://github.com/markhpc/ceph/commit/29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5dfdf


Adam ended up going the extra mile and completely changed how shared blobs works which probably eliminates the need to do defrag on clone anymore from an rbd-mirror perspective, but I think we still need to identify any times we are doing full object reads of fragmented objects and consider defragmenting at that time.  It might be clone, or scrub, or other things, but the point is that if we are already doing most of the work (seeks on HDD especially!) the extra cost of a large write to clean it up isn't that bad, especially if we are doing it over the course of months or years and can help keep freespace less fragmented.


Mark


On 5/24/23 07:17, Hector Martin wrote:
Hi,

I've been seeing relatively large fragmentation numbers on all my OSDs:

ceph daemon osd.13 bluestore allocator score block
{
     "fragmentation_rating": 0.77251526920454427
}

These aren't that old, as I recreated them all around July last year.
They mostly hold CephFS data with erasure coding, with a mix of large
and small files. The OSDs are at around 80%-85% utilization right now.
Most of the data was written sequentially when the OSDs were created (I
rsynced everything from a remote backup). Since then more data has been
added, but not particularly quickly.

At some point I noticed pathologically slow writes, and I couldn't
figure out what was wrong. Eventually I did some block tracing and
noticed the I/Os were very small, even though CephFS-side I was just
writing one large file sequentially, and that's when I stumbled upon the
free space fragmentation problem. Indeed, deleting some large files
opened up some larger free extents and resolved the problem, but only
until those get filled up and I'm back to fragmented tiny extents. So
effectively I'm stuck at the current utilization, as trying to fill them
up any more just slows down to an absolute crawl.

I'm adding a few more OSDs and plan on doing the dance of removing one
OSD at a time and replacing it with another one to hopefully improve the
situation, but obviously this is going to take forever.

Is there any plan for offering a defrag tool of some sort for bluestore?

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux