Hi Nikola,
Just to be clear, these were the settings that you changed back to the
defaults?
Non-default settings are:
"bluestore_cache_size_hdd": {
"default": "1073741824",
"mon": "4294967296",
"final": "4294967296"
},
"bluestore_cache_size_ssd": {
"default": "3221225472",
"mon": "4294967296",
"final": "4294967296"
},
...
"osd_memory_cache_min": {
"default": "134217728",
"mon": "2147483648",
"final": "2147483648"
},
"osd_memory_target": {
"default": "4294967296",
"mon": "17179869184",
"final": "17179869184"
},
"osd_scrub_sleep": {
"default": 0,
"mon": 0.10000000000000001,
"final": 0.10000000000000001
},
"rbd_balance_parent_reads": {
"default": false,
"mon": true,
"final": true
},
Thanks,
Mark
On 5/23/23 12:17, Nikola Ciprich wrote:
Hello Igor,
just reporting, that since last restart (after reverting changed values
to their defaults) the performance hasn't decreased (and it's been over
two weeks now). So either it helped after all, or the drop is caused
by something else I'll yet have to figure out.. we've automated the test
so once the performance drops beyond threshold, I'll know it and
investigate further (and report)
cheers
with regards
nik
On Wed, May 10, 2023 at 07:36:06AM +0200, Nikola Ciprich wrote:
Hello Igor,
You didn't reset the counters every hour, do you? So having average
subop_w_latency growing that way means the current values were much higher
than before.
bummer, I didn't.. I've updated gather script to reset stats, wait 10m and then
gather perf data, each hour. It's running since yesterday, so now we'll have to wait
about one week for the problem to appear again..
Curious if subop latencies were growing for every OSD or just a subset (may
be even just a single one) of them?
since I only have long time averaga, it's not easy to say, but based on what we have:
only two OSDs avg got sub_w_lat > 0.0006. no clear relation between them
19 OSDs got avg sub_w_lat > 0.0005 - this is more interesting - 15 out of them
are on those later installed nodes (note that those nodes have almost no VMs running
so they are much less used!) 4 are on other nodes. but also note, that not all
of OSDs on suspicious nodes are over the threshold, it's 6, 6 and 3 out of 7 OSDs
on the node. but still it's strange..
Next time you reach the bad state please do the following if possible:
- reset perf counters for every OSD
- leave the cluster running for 10 mins and collect perf counters again.
- Then start restarting OSD one-by-one starting with the worst OSD (in terms
of subop_w_lat from the prev step). Wouldn't be sufficient to reset just a
few OSDs before the cluster is back to normal?
will do once it slows down again.
I see very similar crash reported here:https://tracker.ceph.com/issues/56346
so I'm not reporting..
Do you think this might somehow be the cause of the problem? Anything else I should
check in perf dumps or elsewhere?
Hmm... don't know yet. Could you please last 20K lines prior the crash from
e.g two sample OSDs?
https://storage.linuxbox.cz/index.php/s/o5bMaGMiZQxWadi
And the crash isn't permanent, OSDs are able to start after the second(?)
shot, aren't they?
yes, actually they start after issuing systemctl ceph-osd@xx restart, it just takes
long time performing log recovery..
If I can provide more info, please let me know
BR
nik
--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava
tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz
mobil servis: +420 737 238 656
email servis: servis@xxxxxxxxxxx
-------------------------------------
--
Best Regards,
Mark Nelson
Head of R&D (USA)
Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson@xxxxxxxxx
We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx