Re: High usage (DATA column) on dedicated for OMAP only OSDs

Александр Руденко <a.rudikk@xxxxxxxxx> · Thu, 19 Sep 2024 17:16:15 +0300

Igor, thanks, very helpful.

Our current osdmap weighs 1.4MB. And it changes all calculations..

Looks like it can be our case.

I think we have this situation due to long backfilling which takes place
now and going for the last 3 weeks.
Can we drop some amount of osdmaps before rebalance completes?

чт, 19 сент. 2024 г. в 15:38, Igor Fedotov <igor.fedotov@xxxxxxxx>:

> please see my comments inline.
>
>
> On 9/19/2024 1:53 PM, Александр Руденко wrote:
>
> Igor, thanks!
>
> > What are the numbers today?
>
> Today we have the same "oldest_map": 2408326 and "newest_map": 2637838,
> *+2191*.
>
> ceph-objectstore-tool --op meta-list --data-path /var/lib/ceph/osd/ceph-70
> | grep osdmap | wc -l
> 458994
>
> Can you clarify this, please:
>
> > and then multiply by amount of OSDs to learn the minimal space taken by
> this data
>
> 458994 * 4k * OSDs count = "*size of osdmaps on ONE OSD*" or "*total size
> of osdmaps on ALL OSDs*" ?
>
> Yes, this is a lower bound estimation for osdmap size on all OSDs.
>
>
>
> Because we have about 3k OSDS and 458994 * 4k * 3000 = ~5TB and it can be
> placed on ONE OSD.
> But if it is TOTAL osdmap size, I think it is a very small size per OSD.
>
> Highly likely that osdmap for 3K OSDs takes much more than 4K on disk. So
> again that was just lower bound estimation.
>
> In fact one can use 'ceph osd getmap >out.dat' and get better estimation
> of osdmap size. So please substitute 4K in the formala above to get better
> estimation for the overall space taken.
>
> It's a bit simplified though since just half of the entries in 'meta' pool
> are full osdmaps. Hence you might want to use 458994/2 * sizeof(osdmap) +
> 458994/2 * 4K in the above formula.
>
> Which is again a sort of low bound estimation but with a better accuracy.
>
>
>
> But we have a lot of osds with min_alloc_size=64k which was default in
> previous ceph's versions for rotational drives (all our SSDs behind old
> RAID controllers).
>
> ceph daemon osd.10 bluestore allocator dump block | head -10
> {
>     "capacity": 479557844992,
>     "alloc_unit": 65536,
>
> But even with min_alloc=64k it will not be a big amount of data 458994 *
> 64k = *~23GB*. I think we have about *150GB+* extra per SSD OSDs.
>
> Yeah, you should use 64K instead of 4K for the above formula if you have
> the majority of OSDs using 64K alloc unit. Or take this into account
> somehow else (e.g. take half 4K and half 64K). But I'm leaving this as a
> "home excercise" for yourself. The main point here is that a single object
> would take at least alloc_unit size. And hence I was trying to make the
> assessment without knowing actual osdmap size but using alloc unit one.
> Just to check ifwe get numbers of the same order of magnitude. And 23GB and
> 150GB aren't THAT differ - having e.g. 1M osdmap might easily do the trick.
> I.e. the osdmap leak indeed could be a real factor here. And hence it's
> worth additional investigation.
>
> Anyway - please use the obtained osdmap size. It could adjust the
> resulting estimation value dramatically.
>
>
> For example, SSD with min_alloc=4k:
> ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP    META
> AVAIL   %USE   VAR   PGS  STATUS
> 126   ssd  0.00005   1.00000  447 GiB  374 GiB  300 GiB  72 GiB  1.4 GiB
>  73 GiB  83.64  1.00  137      up
>
> with min_alloc=64k:
> ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP    META
> AVAIL   %USE   VAR   PGS  STATUS
> 10    ssd  0.00005   0.75000  447 GiB  405 GiB  320 GiB  83 GiB  1.4 GiB
>  42 GiB  90.59  1.00  114      up
>
> Diff is not as big as 4k vs 64k..
>
> Right. Don't know the reason atm. May be leaking osdmaps is not the only
> iissue. Please do the corrected math as per above though..
>
>
>
> чт, 19 сент. 2024 г. в 12:33, Igor Fedotov <igor.fedotov@xxxxxxxx>:
>
>> Hi Konstantin,
>>
>> osd_target_transaction_size should control that.
>>
>> I've heard of it being raized to 150 with no obvious issues. Going beyond
>> is at your own risk. So I'd suggest to apply incremental increase if needed.
>>
>>
>> Thanks,
>>
>> Igor
>> On 9/19/2024 10:44 AM, Konstantin Shalygin wrote:
>>
>> Hi Igor,
>>
>> On 18 Sep 2024, at 18:22, Igor Fedotov <igor.fedotov@xxxxxxxx>
>> <igor.fedotov@xxxxxxxx> wrote:
>>
>> I recall a couple of cases when permanent osdmap epoch growth has been
>> filling OSD with relevant osd map info. Which could be tricky to catch.
>>
>> Please run 'ceph tell osd.N status" for a couple of affected OSDs twice
>> within e.g. 10 min interval.
>>
>> Then check the delta between oldest_map and newest_map fields - neither
>> the delta should be very large (hundreds of thousands) nor it should grow
>> rapidly within the observed interval.
>>
>>
>> Side question by topic. What is option controls how much maps to prune?
>> Currently I need to trim 1M osdmaps, but when new map issued, only 30 old
>> maps are removed. What option controls value=30?
>>
>>
>> Thanks,
>> k
>>
>> --
>> Igor Fedotov
>> Ceph Lead Developer
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>
>> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx