please see my comments inline.
On 9/19/2024 1:53 PM, Александр Руденко wrote:
Igor, thanks!
> What are the numbers today?
Today we have the same "oldest_map": 2408326 and "newest_map":
2637838, *+2191*.
ceph-objectstore-tool --op meta-list --data-path
/var/lib/ceph/osd/ceph-70 | grep osdmap | wc -l
458994
Can you clarify this, please:
> and then multiply by amount of OSDs to learn the minimal space taken
by this data
458994 * 4k * OSDs count = "_size of osdmaps on *ONE* OSD_" or "_total
size of osdmaps on *ALL* OSDs_" ?
Yes, this is a lower bound estimation for osdmap size on all OSDs.
Because we have about 3k OSDS and 458994 * 4k * 3000 = ~5TB and it can
be placed on ONE OSD.
But if it is TOTAL osdmap size, I think it is a very small size per OSD.
Highly likely that osdmap for 3K OSDs takes much more than 4K on disk.
So again that was just lower bound estimation.
In fact one can use 'ceph osd getmap >out.dat' and get better estimation
of osdmap size. So please substitute 4K in the formala above to get
better estimation for the overall space taken.
It's a bit simplified though since just half of the entries in 'meta'
pool are full osdmaps. Hence you might want to use 458994/2 *
sizeof(osdmap) + 458994/2 * 4K in the above formula.
Which is again a sort of low bound estimation but with a better accuracy.
But we have a lot of osds with min_alloc_size=64k which was default in
previous ceph's versions for rotational drives (all our SSDs behind
old RAID controllers).
ceph daemon osd.10 bluestore allocator dump block | head -10
{
"capacity": 479557844992,
"alloc_unit": 65536,
But even with min_alloc=64k it will not be a big amount of data 458994
* 64k = *~23GB*. I think we have about *150GB+* extra per SSD OSDs.
Yeah, you should use 64K instead of 4K for the above formula if you have
the majority of OSDs using 64K alloc unit. Or take this into account
somehow else (e.g. take half 4K and half 64K). But I'm leaving this as a
"home excercise" for yourself. The main point here is that a single
object would take at least alloc_unit size. And hence I was trying to
make the assessment without knowing actual osdmap size but using alloc
unit one. Just to check ifwe get numbers of the same order of magnitude.
And 23GB and 150GB aren't THAT differ - having e.g. 1M osdmap might
easily do the trick. I.e. the osdmap leak indeed could be a real factor
here. And hence it's worth additional investigation.
Anyway - please use the obtained osdmap size. It could adjust the
resulting estimation value dramatically.
For example, SSD with min_alloc=4k:
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS
126 ssd 0.00005 1.00000 447 GiB 374 GiB 300 GiB 72 GiB 1.4
GiB 73 GiB 83.64 1.00 137 up
with min_alloc=64k:
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS
10 ssd 0.00005 0.75000 447 GiB 405 GiB 320 GiB 83 GiB 1.4
GiB 42 GiB 90.59 1.00 114 up
Diff is not as big as 4k vs 64k..
Right. Don't know the reason atm. May be leaking osdmaps is not the only
iissue. Please do the corrected math as per above though..
чт, 19 сент. 2024 г. в 12:33, Igor Fedotov <igor.fedotov@xxxxxxxx>:
Hi Konstantin,
osd_target_transaction_size should control that.
I've heard of it being raized to 150 with no obvious issues. Going
beyond is at your own risk. So I'd suggest to apply incremental
increase if needed.
Thanks,
Igor
On 9/19/2024 10:44 AM, Konstantin Shalygin wrote:
Hi Igor,
On 18 Sep 2024, at 18:22, Igor Fedotov <igor.fedotov@xxxxxxxx>
<mailto:igor.fedotov@xxxxxxxx> wrote:
I recall a couple of cases when permanent osdmap epoch growth
has been filling OSD with relevant osd map info. Which could be
tricky to catch.
Please run 'ceph tell osd.N status" for a couple of affected
OSDs twice within e.g. 10 min interval.
Then check the delta between oldest_map and newest_map fields -
neither the delta should be very large (hundreds of thousands)
nor it should grow rapidly within the observed interval.
Side question by topic. What is option controls how much maps to
prune? Currently I need to trim 1M osdmaps, but when new map
issued, only 30 old maps are removed. What option controls value=30?
Thanks,
k
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us athttps://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io | YouTube:https://goo.gl/PGE1Bx
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us athttps://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io | YouTube:https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx