Here it goes beyond of my expertise.
I saw unbounded osdmap epoch growth for two completely different cases.
And unable to say what's causing it this time.
But IMO you shouldn't do any osdmap trimming yourself - that could
likely result in an unpredictable behavior. So I'd encourage you to find
a way for the cluster to do that gracefully by itself.
Thanks,
Igor
On 9/19/2024 5:16 PM, Александр Руденко wrote:
Igor, thanks, very helpful.
Our current osdmap weighs 1.4MB. And it changes all calculations..
Looks like it can be our case.
I think we have this situation due to long backfilling which takes
place now and going for the last 3 weeks.
Can we drop some amount of osdmaps before rebalance completes?
чт, 19 сент. 2024 г. в 15:38, Igor Fedotov <igor.fedotov@xxxxxxxx>:
please see my comments inline.
On 9/19/2024 1:53 PM, Александр Руденко wrote:
Igor, thanks!
> What are the numbers today?
Today we have the same "oldest_map": 2408326 and "newest_map":
2637838, *+2191*.
ceph-objectstore-tool --op meta-list --data-path
/var/lib/ceph/osd/ceph-70 | grep osdmap | wc -l
458994
Can you clarify this, please:
> and then multiply by amount of OSDs to learn the minimal space
taken by this data
458994 * 4k * OSDs count = "_size of osdmaps on *ONE* OSD_" or
"_total size of osdmaps on *ALL* OSDs_" ?
Yes, this is a lower bound estimation for osdmap size on all OSDs.
Because we have about 3k OSDS and 458994 * 4k * 3000 = ~5TB and
it can be placed on ONE OSD.
But if it is TOTAL osdmap size, I think it is a very small size
per OSD.
Highly likely that osdmap for 3K OSDs takes much more than 4K on
disk. So again that was just lower bound estimation.
In fact one can use 'ceph osd getmap >out.dat' and get better
estimation of osdmap size. So please substitute 4K in the formala
above to get better estimation for the overall space taken.
It's a bit simplified though since just half of the entries in
'meta' pool are full osdmaps. Hence you might want to use 458994/2
* sizeof(osdmap) + 458994/2 * 4K in the above formula.
Which is again a sort of low bound estimation but with a better
accuracy.
But we have a lot of osds with min_alloc_size=64k which was
default in previous ceph's versions for rotational drives (all
our SSDs behind old RAID controllers).
ceph daemon osd.10 bluestore allocator dump block | head -10
{
"capacity": 479557844992,
"alloc_unit": 65536,
But even with min_alloc=64k it will not be a big amount of data
458994 * 64k = *~23GB*. I think we have about *150GB+* extra per
SSD OSDs.
Yeah, you should use 64K instead of 4K for the above formula if
you have the majority of OSDs using 64K alloc unit. Or take this
into account somehow else (e.g. take half 4K and half 64K). But
I'm leaving this as a "home excercise" for yourself. The main
point here is that a single object would take at least alloc_unit
size. And hence I was trying to make the assessment without
knowing actual osdmap size but using alloc unit one. Just to check
ifwe get numbers of the same order of magnitude. And 23GB and
150GB aren't THAT differ - having e.g. 1M osdmap might easily do
the trick. I.e. the osdmap leak indeed could be a real factor
here. And hence it's worth additional investigation.
Anyway - please use the obtained osdmap size. It could adjust the
resulting estimation value dramatically.
For example, SSD with min_alloc=4k:
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP
META AVAIL %USE VAR PGS STATUS
126 ssd 0.00005 1.00000 447 GiB 374 GiB 300 GiB 72 GiB
1.4 GiB 73 GiB 83.64 1.00 137 up
with min_alloc=64k:
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP
META AVAIL %USE VAR PGS STATUS
10 ssd 0.00005 0.75000 447 GiB 405 GiB 320 GiB 83 GiB
1.4 GiB 42 GiB 90.59 1.00 114 up
Diff is not as big as 4k vs 64k..
Right. Don't know the reason atm. May be leaking osdmaps is not
the only iissue. Please do the corrected math as per above though..
чт, 19 сент. 2024 г. в 12:33, Igor Fedotov <igor.fedotov@xxxxxxxx>:
Hi Konstantin,
osd_target_transaction_size should control that.
I've heard of it being raized to 150 with no obvious issues.
Going beyond is at your own risk. So I'd suggest to apply
incremental increase if needed.
Thanks,
Igor
On 9/19/2024 10:44 AM, Konstantin Shalygin wrote:
Hi Igor,
On 18 Sep 2024, at 18:22, Igor Fedotov
<igor.fedotov@xxxxxxxx> <mailto:igor.fedotov@xxxxxxxx> wrote:
I recall a couple of cases when permanent osdmap epoch
growth has been filling OSD with relevant osd map info.
Which could be tricky to catch.
Please run 'ceph tell osd.N status" for a couple of
affected OSDs twice within e.g. 10 min interval.
Then check the delta between oldest_map and newest_map
fields - neither the delta should be very large (hundreds
of thousands) nor it should grow rapidly within the
observed interval.
Side question by topic. What is option controls how much
maps to prune? Currently I need to trim 1M osdmaps, but when
new map issued, only 30 old maps are removed. What option
controls value=30?
Thanks,
k
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us athttps://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io | YouTube:https://goo.gl/PGE1Bx
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us athttps://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io | YouTube:https://goo.gl/PGE1Bx
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us athttps://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io | YouTube:https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx