Re: High usage (DATA column) on dedicated for OMAP only OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Whenever we've seen osdmaps not being trimmed, we've made sure that
any down OSDs are out+destroyed, and then have rolled a restart
through the mons. As of recent Pacific at least this seems to have
reliably gotten us out of this situation.

Josh

On Thu, Sep 19, 2024 at 9:14 AM Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>
> Here it goes beyond of my expertise.
>
> I saw unbounded osdmap epoch growth for two completely different cases.
> And unable to say what's causing it this time.
>
> But IMO you shouldn't do any osdmap trimming yourself - that could
> likely result in an unpredictable behavior. So I'd encourage you to find
> a way for the cluster to do that gracefully by itself.
>
>
> Thanks,
>
> Igor
>
> On 9/19/2024 5:16 PM, Александр Руденко wrote:
> > Igor, thanks, very helpful.
> >
> > Our current osdmap weighs 1.4MB. And it changes all calculations..
> >
> > Looks like it can be our case.
> >
> > I think we have this situation due to long backfilling which takes
> > place now and going for the last 3 weeks.
> > Can we drop some amount of osdmaps before rebalance completes?
> >
> >
> >
> >
> >
> >
> > чт, 19 сент. 2024 г. в 15:38, Igor Fedotov <igor.fedotov@xxxxxxxx>:
> >
> >     please see my comments inline.
> >
> >
> >     On 9/19/2024 1:53 PM, Александр Руденко wrote:
> >>     Igor, thanks!
> >>
> >>     > What are the numbers today?
> >>
> >>     Today we have the same "oldest_map": 2408326 and "newest_map":
> >>     2637838, *+2191*.
> >>
> >>     ceph-objectstore-tool --op meta-list --data-path
> >>     /var/lib/ceph/osd/ceph-70 | grep osdmap | wc -l
> >>     458994
> >>
> >>     Can you clarify this, please:
> >>
> >>     > and then multiply by amount of OSDs to learn the minimal space
> >>     taken by this data
> >>
> >>     458994 * 4k * OSDs count = "_size of osdmaps on *ONE* OSD_" or
> >>     "_total size of osdmaps on *ALL* OSDs_" ?
> >
> >     Yes, this is a lower bound estimation for osdmap size on all OSDs.
> >
> >
> >>
> >>     Because we have about 3k OSDS and 458994 * 4k * 3000 = ~5TB and
> >>     it can be placed on ONE OSD.
> >>     But if it is TOTAL osdmap size, I think it is a very small size
> >>     per OSD.
> >
> >     Highly likely that osdmap for 3K OSDs takes much more than 4K on
> >     disk. So again that was just lower bound estimation.
> >
> >     In fact one can use 'ceph osd getmap >out.dat' and get better
> >     estimation of osdmap size. So please substitute 4K in the formala
> >     above to get better estimation for the overall space taken.
> >
> >     It's a bit simplified though since just half of the entries in
> >     'meta' pool are full osdmaps. Hence you might want to use 458994/2
> >     * sizeof(osdmap) +  458994/2 * 4K in the above formula.
> >
> >     Which is again a sort of low bound estimation but with a better
> >     accuracy.
> >
> >
> >>
> >>     But we have a lot of osds with min_alloc_size=64k which was
> >>     default in previous ceph's versions for rotational drives (all
> >>     our SSDs behind old RAID controllers).
> >>
> >>     ceph daemon osd.10 bluestore allocator dump block | head -10
> >>     {
> >>         "capacity": 479557844992,
> >>         "alloc_unit": 65536,
> >>
> >>     But even with min_alloc=64k it will not be a big amount of data
> >>     458994 * 64k = *~23GB*. I think we have about *150GB+* extra per
> >>     SSD OSDs.
> >>
> >     Yeah, you should use 64K instead of 4K for the above formula if
> >     you have the majority of OSDs using 64K alloc unit. Or take this
> >     into account somehow else (e.g. take half 4K and half 64K). But
> >     I'm leaving this as a "home excercise" for yourself. The main
> >     point here is that a single object would take at least alloc_unit
> >     size. And hence I was trying to make the assessment without
> >     knowing actual osdmap size but using alloc unit one. Just to check
> >     ifwe get numbers of the same order of magnitude. And 23GB and
> >     150GB aren't THAT differ - having e.g. 1M osdmap might easily do
> >     the trick. I.e. the osdmap leak indeed could be a real factor
> >     here. And hence it's worth additional investigation.
> >
> >     Anyway - please use the obtained osdmap size. It could adjust the
> >     resulting estimation value dramatically.
> >
> >
> >>     For example, SSD with min_alloc=4k:
> >>     ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP
> >>      META     AVAIL   %USE   VAR PGS  STATUS
> >>     126   ssd  0.00005   1.00000  447 GiB  374 GiB  300 GiB  72 GiB
> >>      1.4 GiB  73 GiB  83.64  1.00  137  up
> >>
> >>     with min_alloc=64k:
> >>     ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP
> >>      META     AVAIL   %USE   VAR PGS  STATUS
> >>     10    ssd  0.00005   0.75000  447 GiB  405 GiB  320 GiB  83 GiB
> >>      1.4 GiB  42 GiB  90.59  1.00  114  up
> >>
> >>     Diff is not as big as 4k vs 64k..
> >
> >     Right. Don't know the reason atm. May be leaking osdmaps is not
> >     the only iissue. Please do the corrected math as per above though..
> >
> >
> >>
> >>     чт, 19 сент. 2024 г. в 12:33, Igor Fedotov <igor.fedotov@xxxxxxxx>:
> >>
> >>         Hi Konstantin,
> >>
> >>         osd_target_transaction_size should control that.
> >>
> >>         I've heard of it being raized to 150 with no obvious issues.
> >>         Going beyond is at your own risk. So I'd suggest to apply
> >>         incremental increase if needed.
> >>
> >>
> >>         Thanks,
> >>
> >>         Igor
> >>
> >>         On 9/19/2024 10:44 AM, Konstantin Shalygin wrote:
> >>>         Hi Igor,
> >>>
> >>>>         On 18 Sep 2024, at 18:22, Igor Fedotov
> >>>>         <igor.fedotov@xxxxxxxx> <mailto:igor.fedotov@xxxxxxxx> wrote:
> >>>>
> >>>>         I recall a couple of cases when permanent osdmap epoch
> >>>>         growth has been filling OSD with relevant osd map info.
> >>>>         Which could be tricky to catch.
> >>>>
> >>>>         Please run 'ceph tell osd.N status" for a couple of
> >>>>         affected OSDs twice within e.g. 10 min interval.
> >>>>
> >>>>         Then check the delta between oldest_map and newest_map
> >>>>         fields - neither the delta should be very large (hundreds
> >>>>         of thousands) nor it should grow rapidly within the
> >>>>         observed interval.
> >>>
> >>>         Side question by topic. What is option controls how much
> >>>         maps to prune? Currently I need to trim 1M osdmaps, but when
> >>>         new map issued, only 30 old maps are removed. What option
> >>>         controls value=30?
> >>>
> >>>
> >>>         Thanks,
> >>>         k
> >>
> >>         --
> >>         Igor Fedotov
> >>         Ceph Lead Developer
> >>
> >>         Looking for help with your Ceph cluster? Contact us athttps://croit.io
> >>
> >>         croit GmbH, Freseniusstr. 31h, 81247 Munich
> >>         CEO: Martin Verges - VAT-ID: DE310638492
> >>         Com. register: Amtsgericht Munich HRB 231263
> >>         Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
> >>
> >     --
> >     Igor Fedotov
> >     Ceph Lead Developer
> >
> >     Looking for help with your Ceph cluster? Contact us athttps://croit.io
> >
> >     croit GmbH, Freseniusstr. 31h, 81247 Munich
> >     CEO: Martin Verges - VAT-ID: DE310638492
> >     Com. register: Amtsgericht Munich HRB 231263
> >     Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
> >
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us athttps://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux