Whenever we've seen osdmaps not being trimmed, we've made sure that any down OSDs are out+destroyed, and then have rolled a restart through the mons. As of recent Pacific at least this seems to have reliably gotten us out of this situation. Josh On Thu, Sep 19, 2024 at 9:14 AM Igor Fedotov <igor.fedotov@xxxxxxxx> wrote: > > Here it goes beyond of my expertise. > > I saw unbounded osdmap epoch growth for two completely different cases. > And unable to say what's causing it this time. > > But IMO you shouldn't do any osdmap trimming yourself - that could > likely result in an unpredictable behavior. So I'd encourage you to find > a way for the cluster to do that gracefully by itself. > > > Thanks, > > Igor > > On 9/19/2024 5:16 PM, Александр Руденко wrote: > > Igor, thanks, very helpful. > > > > Our current osdmap weighs 1.4MB. And it changes all calculations.. > > > > Looks like it can be our case. > > > > I think we have this situation due to long backfilling which takes > > place now and going for the last 3 weeks. > > Can we drop some amount of osdmaps before rebalance completes? > > > > > > > > > > > > > > чт, 19 сент. 2024 г. в 15:38, Igor Fedotov <igor.fedotov@xxxxxxxx>: > > > > please see my comments inline. > > > > > > On 9/19/2024 1:53 PM, Александр Руденко wrote: > >> Igor, thanks! > >> > >> > What are the numbers today? > >> > >> Today we have the same "oldest_map": 2408326 and "newest_map": > >> 2637838, *+2191*. > >> > >> ceph-objectstore-tool --op meta-list --data-path > >> /var/lib/ceph/osd/ceph-70 | grep osdmap | wc -l > >> 458994 > >> > >> Can you clarify this, please: > >> > >> > and then multiply by amount of OSDs to learn the minimal space > >> taken by this data > >> > >> 458994 * 4k * OSDs count = "_size of osdmaps on *ONE* OSD_" or > >> "_total size of osdmaps on *ALL* OSDs_" ? > > > > Yes, this is a lower bound estimation for osdmap size on all OSDs. > > > > > >> > >> Because we have about 3k OSDS and 458994 * 4k * 3000 = ~5TB and > >> it can be placed on ONE OSD. > >> But if it is TOTAL osdmap size, I think it is a very small size > >> per OSD. > > > > Highly likely that osdmap for 3K OSDs takes much more than 4K on > > disk. So again that was just lower bound estimation. > > > > In fact one can use 'ceph osd getmap >out.dat' and get better > > estimation of osdmap size. So please substitute 4K in the formala > > above to get better estimation for the overall space taken. > > > > It's a bit simplified though since just half of the entries in > > 'meta' pool are full osdmaps. Hence you might want to use 458994/2 > > * sizeof(osdmap) + 458994/2 * 4K in the above formula. > > > > Which is again a sort of low bound estimation but with a better > > accuracy. > > > > > >> > >> But we have a lot of osds with min_alloc_size=64k which was > >> default in previous ceph's versions for rotational drives (all > >> our SSDs behind old RAID controllers). > >> > >> ceph daemon osd.10 bluestore allocator dump block | head -10 > >> { > >> "capacity": 479557844992, > >> "alloc_unit": 65536, > >> > >> But even with min_alloc=64k it will not be a big amount of data > >> 458994 * 64k = *~23GB*. I think we have about *150GB+* extra per > >> SSD OSDs. > >> > > Yeah, you should use 64K instead of 4K for the above formula if > > you have the majority of OSDs using 64K alloc unit. Or take this > > into account somehow else (e.g. take half 4K and half 64K). But > > I'm leaving this as a "home excercise" for yourself. The main > > point here is that a single object would take at least alloc_unit > > size. And hence I was trying to make the assessment without > > knowing actual osdmap size but using alloc unit one. Just to check > > ifwe get numbers of the same order of magnitude. And 23GB and > > 150GB aren't THAT differ - having e.g. 1M osdmap might easily do > > the trick. I.e. the osdmap leak indeed could be a real factor > > here. And hence it's worth additional investigation. > > > > Anyway - please use the obtained osdmap size. It could adjust the > > resulting estimation value dramatically. > > > > > >> For example, SSD with min_alloc=4k: > >> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP > >> META AVAIL %USE VAR PGS STATUS > >> 126 ssd 0.00005 1.00000 447 GiB 374 GiB 300 GiB 72 GiB > >> 1.4 GiB 73 GiB 83.64 1.00 137 up > >> > >> with min_alloc=64k: > >> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP > >> META AVAIL %USE VAR PGS STATUS > >> 10 ssd 0.00005 0.75000 447 GiB 405 GiB 320 GiB 83 GiB > >> 1.4 GiB 42 GiB 90.59 1.00 114 up > >> > >> Diff is not as big as 4k vs 64k.. > > > > Right. Don't know the reason atm. May be leaking osdmaps is not > > the only iissue. Please do the corrected math as per above though.. > > > > > >> > >> чт, 19 сент. 2024 г. в 12:33, Igor Fedotov <igor.fedotov@xxxxxxxx>: > >> > >> Hi Konstantin, > >> > >> osd_target_transaction_size should control that. > >> > >> I've heard of it being raized to 150 with no obvious issues. > >> Going beyond is at your own risk. So I'd suggest to apply > >> incremental increase if needed. > >> > >> > >> Thanks, > >> > >> Igor > >> > >> On 9/19/2024 10:44 AM, Konstantin Shalygin wrote: > >>> Hi Igor, > >>> > >>>> On 18 Sep 2024, at 18:22, Igor Fedotov > >>>> <igor.fedotov@xxxxxxxx> <mailto:igor.fedotov@xxxxxxxx> wrote: > >>>> > >>>> I recall a couple of cases when permanent osdmap epoch > >>>> growth has been filling OSD with relevant osd map info. > >>>> Which could be tricky to catch. > >>>> > >>>> Please run 'ceph tell osd.N status" for a couple of > >>>> affected OSDs twice within e.g. 10 min interval. > >>>> > >>>> Then check the delta between oldest_map and newest_map > >>>> fields - neither the delta should be very large (hundreds > >>>> of thousands) nor it should grow rapidly within the > >>>> observed interval. > >>> > >>> Side question by topic. What is option controls how much > >>> maps to prune? Currently I need to trim 1M osdmaps, but when > >>> new map issued, only 30 old maps are removed. What option > >>> controls value=30? > >>> > >>> > >>> Thanks, > >>> k > >> > >> -- > >> Igor Fedotov > >> Ceph Lead Developer > >> > >> Looking for help with your Ceph cluster? Contact us athttps://croit.io > >> > >> croit GmbH, Freseniusstr. 31h, 81247 Munich > >> CEO: Martin Verges - VAT-ID: DE310638492 > >> Com. register: Amtsgericht Munich HRB 231263 > >> Web:https://croit.io | YouTube:https://goo.gl/PGE1Bx > >> > > -- > > Igor Fedotov > > Ceph Lead Developer > > > > Looking for help with your Ceph cluster? Contact us athttps://croit.io > > > > croit GmbH, Freseniusstr. 31h, 81247 Munich > > CEO: Martin Verges - VAT-ID: DE310638492 > > Com. register: Amtsgericht Munich HRB 231263 > > Web:https://croit.io | YouTube:https://goo.gl/PGE1Bx > > > -- > Igor Fedotov > Ceph Lead Developer > > Looking for help with your Ceph cluster? Contact us athttps://croit.io > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web:https://croit.io | YouTube:https://goo.gl/PGE1Bx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx