Re: High usage (DATA column) on dedicated for OMAP only OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



please see my comments inline.


On 9/19/2024 1:53 PM, Александр Руденко wrote:
Igor, thanks!

> What are the numbers today?

Today we have the same "oldest_map": 2408326 and "newest_map": 2637838, *+2191*.

ceph-objectstore-tool --op meta-list --data-path /var/lib/ceph/osd/ceph-70 | grep osdmap | wc -l
458994

Can you clarify this, please:

> and then multiply by amount of OSDs to learn the minimal space taken by this data

458994 * 4k * OSDs count = "_size of osdmaps on *ONE* OSD_" or "_total size of osdmaps on *ALL* OSDs_" ?

Yes, this is a lower bound estimation for osdmap size on all OSDs.



Because we have about 3k OSDS and 458994 * 4k * 3000 = ~5TB and it can be placed on ONE OSD.
But if it is TOTAL osdmap size, I think it is a very small size per OSD.

Highly likely that osdmap for 3K OSDs takes much more than 4K on disk. So again that was just lower bound estimation.

In fact one can use 'ceph osd getmap >out.dat' and get better estimation of osdmap size. So please substitute 4K in the formala above to get better estimation for the overall space taken.

It's a bit simplified though since just half of the entries in 'meta' pool are full osdmaps. Hence you might want to use 458994/2 * sizeof(osdmap) +  458994/2 * 4K in the above formula.

Which is again a sort of low bound estimation but with a better accuracy.



But we have a lot of osds with min_alloc_size=64k which was default in previous ceph's versions for rotational drives (all our SSDs behind old RAID controllers).

ceph daemon osd.10 bluestore allocator dump block | head -10
{
    "capacity": 479557844992,
    "alloc_unit": 65536,

But even with min_alloc=64k it will not be a big amount of data 458994 * 64k = *~23GB*. I think we have about *150GB+* extra per SSD OSDs.

Yeah, you should use 64K instead of 4K for the above formula if you have the majority of OSDs using 64K alloc unit. Or take this into account somehow else (e.g. take half 4K and half 64K). But I'm leaving this as a "home excercise" for yourself. The main point here is that a single object would take at least alloc_unit size. And hence I was trying to make the assessment without knowing actual osdmap size but using alloc unit one. Just to check ifwe get numbers of the same order of magnitude. And 23GB and 150GB aren't THAT differ - having e.g. 1M osdmap might easily do the trick. I.e. the osdmap leak indeed could be a real factor here. And hence it's worth additional investigation.

Anyway - please use the obtained osdmap size. It could adjust the resulting estimation value dramatically.


For example, SSD with min_alloc=4k:
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP    META     AVAIL   %USE   VAR   PGS  STATUS 126   ssd  0.00005   1.00000  447 GiB  374 GiB  300 GiB  72 GiB  1.4 GiB  73 GiB  83.64  1.00  137      up

with min_alloc=64k:
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP    META     AVAIL   %USE   VAR   PGS  STATUS 10    ssd  0.00005   0.75000  447 GiB  405 GiB  320 GiB  83 GiB  1.4 GiB  42 GiB  90.59  1.00  114      up

Diff is not as big as 4k vs 64k..

Right. Don't know the reason atm. May be leaking osdmaps is not the only iissue. Please do the corrected math as per above though..



чт, 19 сент. 2024 г. в 12:33, Igor Fedotov <igor.fedotov@xxxxxxxx>:

    Hi Konstantin,

    osd_target_transaction_size should control that.

    I've heard of it being raized to 150 with no obvious issues. Going
    beyond is at your own risk. So I'd suggest to apply incremental
    increase if needed.


    Thanks,

    Igor

    On 9/19/2024 10:44 AM, Konstantin Shalygin wrote:
    Hi Igor,

    On 18 Sep 2024, at 18:22, Igor Fedotov <igor.fedotov@xxxxxxxx>
    <mailto:igor.fedotov@xxxxxxxx> wrote:

    I recall a couple of cases when permanent osdmap epoch growth
    has been filling OSD with relevant osd map info. Which could be
    tricky to catch.

    Please run 'ceph tell osd.N status" for a couple of affected
    OSDs twice within e.g. 10 min interval.

    Then check the delta between oldest_map and newest_map fields -
    neither the delta should be very large (hundreds of thousands)
    nor it should grow rapidly within the observed interval.

    Side question by topic. What is option controls how much maps to
    prune? Currently I need to trim 1M osdmaps, but when new map
    issued, only 30 old maps are removed. What option controls value=30?


    Thanks,
    k

-- Igor Fedotov
    Ceph Lead Developer

    Looking for help with your Ceph cluster? Contact us athttps://croit.io

    croit GmbH, Freseniusstr. 31h, 81247 Munich
    CEO: Martin Verges - VAT-ID: DE310638492
    Com. register: Amtsgericht Munich HRB 231263
    Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux