Re: High usage (DATA column) on dedicated for OMAP only OSDs

Igor Fedotov <igor.fedotov@xxxxxxxx> · Thu, 19 Sep 2024 18:13:44 +0300

Here it goes beyond of my expertise.

I saw unbounded osdmap epoch growth for two completely different cases. 
And unable to say what's causing it this time.

But IMO you shouldn't do any osdmap trimming yourself - that could 
likely result in an unpredictable behavior. So I'd encourage you to find 
a way for the cluster to do that gracefully by itself.

Thanks,

Igor

On 9/19/2024 5:16 PM, Александр Руденко wrote:
Igor, thanks, very helpful.

Our current osdmap weighs 1.4MB. And it changes all calculations..

Looks like it can be our case.

I think we have this situation due to long backfilling which takes 
place now and going for the last 3 weeks.
Can we drop some amount of osdmaps before rebalance completes?

чт, 19 сент. 2024 г. в 15:38, Igor Fedotov <igor.fedotov@xxxxxxxx>:

    please see my comments inline.

    On 9/19/2024 1:53 PM, Александр Руденко wrote:
    Igor, thanks!

    > What are the numbers today?

    Today we have the same "oldest_map": 2408326 and "newest_map":
    2637838, *+2191*.

    ceph-objectstore-tool --op meta-list --data-path
    /var/lib/ceph/osd/ceph-70 | grep osdmap | wc -l
    458994

    Can you clarify this, please:

    > and then multiply by amount of OSDs to learn the minimal space
    taken by this data

    458994 * 4k * OSDs count = "_size of osdmaps on *ONE* OSD_" or
    "_total size of osdmaps on *ALL* OSDs_" ?

    Yes, this is a lower bound estimation for osdmap size on all OSDs.

    Because we have about 3k OSDS and 458994 * 4k * 3000 = ~5TB and
    it can be placed on ONE OSD.
    But if it is TOTAL osdmap size, I think it is a very small size
    per OSD.

    Highly likely that osdmap for 3K OSDs takes much more than 4K on
    disk. So again that was just lower bound estimation.

    In fact one can use 'ceph osd getmap >out.dat' and get better
    estimation of osdmap size. So please substitute 4K in the formala
    above to get better estimation for the overall space taken.

    It's a bit simplified though since just half of the entries in
    'meta' pool are full osdmaps. Hence you might want to use 458994/2
    * sizeof(osdmap) +  458994/2 * 4K in the above formula.

    Which is again a sort of low bound estimation but with a better
    accuracy.

    But we have a lot of osds with min_alloc_size=64k which was
    default in previous ceph's versions for rotational drives (all
    our SSDs behind old RAID controllers).

    ceph daemon osd.10 bluestore allocator dump block | head -10
    {
        "capacity": 479557844992,
        "alloc_unit": 65536,

    But even with min_alloc=64k it will not be a big amount of data
    458994 * 64k = *~23GB*. I think we have about *150GB+* extra per
    SSD OSDs.

    Yeah, you should use 64K instead of 4K for the above formula if
    you have the majority of OSDs using 64K alloc unit. Or take this
    into account somehow else (e.g. take half 4K and half 64K). But
    I'm leaving this as a "home excercise" for yourself. The main
    point here is that a single object would take at least alloc_unit
    size. And hence I was trying to make the assessment without
    knowing actual osdmap size but using alloc unit one. Just to check
    ifwe get numbers of the same order of magnitude. And 23GB and
    150GB aren't THAT differ - having e.g. 1M osdmap might easily do
    the trick. I.e. the osdmap leak indeed could be a real factor
    here. And hence it's worth additional investigation.

    Anyway - please use the obtained osdmap size. It could adjust the
    resulting estimation value dramatically.

    For example, SSD with min_alloc=4k:
    ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  
     META     AVAIL   %USE   VAR PGS  STATUS
    126   ssd  0.00005   1.00000  447 GiB  374 GiB  300 GiB  72 GiB
     1.4 GiB  73 GiB  83.64  1.00  137  up

    with min_alloc=64k:
    ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  
     META     AVAIL   %USE   VAR PGS  STATUS
    10    ssd  0.00005   0.75000  447 GiB  405 GiB  320 GiB  83 GiB
     1.4 GiB  42 GiB  90.59  1.00  114  up

    Diff is not as big as 4k vs 64k..

    Right. Don't know the reason atm. May be leaking osdmaps is not
    the only iissue. Please do the corrected math as per above though..

    чт, 19 сент. 2024 г. в 12:33, Igor Fedotov <igor.fedotov@xxxxxxxx>:

        Hi Konstantin,

        osd_target_transaction_size should control that.

        I've heard of it being raized to 150 with no obvious issues.
        Going beyond is at your own risk. So I'd suggest to apply
        incremental increase if needed.

        Thanks,

        Igor

        On 9/19/2024 10:44 AM, Konstantin Shalygin wrote:
        Hi Igor,

        On 18 Sep 2024, at 18:22, Igor Fedotov
        <igor.fedotov@xxxxxxxx> <mailto:igor.fedotov@xxxxxxxx> wrote:

        I recall a couple of cases when permanent osdmap epoch
        growth has been filling OSD with relevant osd map info.
        Which could be tricky to catch.

        Please run 'ceph tell osd.N status" for a couple of
        affected OSDs twice within e.g. 10 min interval.

        Then check the delta between oldest_map and newest_map
        fields - neither the delta should be very large (hundreds
        of thousands) nor it should grow rapidly within the
        observed interval.

        Side question by topic. What is option controls how much
        maps to prune? Currently I need to trim 1M osdmaps, but when
        new map issued, only 30 old maps are removed. What option
        controls value=30?

        Thanks,
        k

        -- 
        Igor Fedotov
        Ceph Lead Developer

        Looking for help with your Ceph cluster? Contact us athttps://croit.io

        croit GmbH, Freseniusstr. 31h, 81247 Munich
        CEO: Martin Verges - VAT-ID: DE310638492
        Com. register: Amtsgericht Munich HRB 231263
        Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx

    -- 
    Igor Fedotov
    Ceph Lead Developer

    Looking for help with your Ceph cluster? Contact us athttps://croit.io

    croit GmbH, Freseniusstr. 31h, 81247 Munich
    CEO: Martin Verges - VAT-ID: DE310638492
    Com. register: Amtsgericht Munich HRB 231263
    Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx