Re: Space leak in Bluestore

Igor Fedotov <ifedotov@xxxxxxx> · Thu, 26 Mar 2020 13:19:11 +0300

Hi Vitaliy,

just as a guess to verify:

a while ago I've been observed very long pool (pretty large) removal. It 
took several days to complete. DB was at spinner which was one of driver 
of this slow behavior.

Another one - PG removal design which enumerates up to 30 entries max to 
fill single removal batch. Then execute it. Everything in a single 
"thread". So the process is pretty slow for millions of objects...

During removal pool (read PGs) space was in use ad decreased slowly. 
Pretty high DB volume utilization was observed.

I assume rebalance performs PG removal as well - may be that's the case?

Thanks,

Igor

On 3/26/2020 1:51 AM, Виталий Филиппов wrote:
Hi Igor,

I think so because
1) space usage increases after each rebalance. Even when the same pg 
is moved twice (!)
2) I use 4k min_alloc_size from the beginning

One crazy hypothesis is that maybe ceph allocates space for 
uncompressed objects, then compresses them and leaks 
(uncompressed-compressed) space. Really crazy idea but who knows o_O.

I already did a deep fsck, it didn't help... what else could I check?...

26 марта 2020 г. 1:40:52 GMT+03:00, Igor Fedotov <ifedotov@xxxxxxx> 
пишет:

    Bluestore fsck/repair detect and fix leaks at Bluestore level but I
    doubt your issue is here.

    To be honest I don't understand from the overview why do you think that
    there are any leaks at all....

    Not sure whether this is relevant but from my experience space "leaks"
    are sometimes caused by 64K allocation unit and keeping tons of small
    files or massive small EC overwrites.

    To verify if this is applicable you might want to inspect bluestore
    performance counters (bluestore_stored vs. bluestore_allocated) to
    estimate your losses due to high allocation units.

    Significant difference at multiple OSDs might indicate that overhead is
    caused by high allocation granularity. Compression might make this
    analysis not that simple though...

    Thanks,

    Igor

    On 3/26/2020 1:19 AM, vitalif@xxxxxxxxxx wrote:

        I have a question regarding this problem - is it possible to
        rebuild bluestore allocation metadata? I could try it to test
        if it's an allocator problem...

            Hi. I'm experiencing some kind of a space leak in
            Bluestore. I use EC, compression and snapshots. First I
            thought that the leak was caused by "virtual clones"
            (issue #38184). However, then I got rid of most of the
            snapshots, but continued to experience the problem. I
            suspected something when I added a new disk to the cluster
            and free space in the cluster didn't increase (!). So to
            track down the issue I moved one PG (34.1a) using upmaps
            from osd11,6,0 to osd6,0,7 and then back to osd11,6,0. It
            ate +59 GB after the first move and +51 GB after the
            second. As I understand this proves that it's not #38184.
            Devirtualizaton of virtual clones couldn't eat additional
            space after SECOND rebalance of the same PG. The PG has
            ~39000 objects, it is EC 2+1 and the compression is
            enabled. Compression ratio is about ~2.7 in my setup, so
            the PG should use ~90 GB raw space. Before and after
            moving the PG I stopped osd0, mounted it with
            ceph-objectstore-tool with debug bluestore = 20/20 and
            opened the 34.1a***/all directory. It seems to dump all
            object extents into the log in that case. So now I have
            two logs with all allocated extents for osd0 (I hope all
            extents are there). I parsed both logs and added all
            compressed blob sizes together ("get_ref Blob ... 0x20000
            -> 0x... compressed"). But they add up to ~39 GB before
            first rebalance (34.1as2), ~22 GB after it (34.1as1) and
            ~41 GB again after the second move (34.1as2) which doesn't
            indicate a leak. But the raw space usage still exceeds
            initial by a lot. So it's clear that there's a leak
            somewhere. What additional details can I provide for you
            to identify the bug? I posted the same message in the
            issue tracker, https://tracker.ceph.com/issues/44731 

--
With best regards,
Vitaliy Filippov 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx