Hello Igor, just reporting, that since last restart (after reverting changed values to their defaults) the performance hasn't decreased (and it's been over two weeks now). So either it helped after all, or the drop is caused by something else I'll yet have to figure out.. we've automated the test so once the performance drops beyond threshold, I'll know it and investigate further (and report) cheers with regards nik On Wed, May 10, 2023 at 07:36:06AM +0200, Nikola Ciprich wrote: > Hello Igor, > > You didn't reset the counters every hour, do you? So having average > > subop_w_latency growing that way means the current values were much higher > > than before. > > bummer, I didn't.. I've updated gather script to reset stats, wait 10m and then > gather perf data, each hour. It's running since yesterday, so now we'll have to wait > about one week for the problem to appear again.. > > > > > > Curious if subop latencies were growing for every OSD or just a subset (may > > be even just a single one) of them? > since I only have long time averaga, it's not easy to say, but based on what we have: > > only two OSDs avg got sub_w_lat > 0.0006. no clear relation between them > 19 OSDs got avg sub_w_lat > 0.0005 - this is more interesting - 15 out of them > are on those later installed nodes (note that those nodes have almost no VMs running > so they are much less used!) 4 are on other nodes. but also note, that not all > of OSDs on suspicious nodes are over the threshold, it's 6, 6 and 3 out of 7 OSDs > on the node. but still it's strange.. > > > > > > > Next time you reach the bad state please do the following if possible: > > > > - reset perf counters for every OSD > > > > - leave the cluster running for 10 mins and collect perf counters again. > > > > - Then start restarting OSD one-by-one starting with the worst OSD (in terms > > of subop_w_lat from the prev step). Wouldn't be sufficient to reset just a > > few OSDs before the cluster is back to normal? > > will do once it slows down again. > > > > > > > > I see very similar crash reported here:https://tracker.ceph.com/issues/56346 > > > so I'm not reporting.. > > > > > > Do you think this might somehow be the cause of the problem? Anything else I should > > > check in perf dumps or elsewhere? > > > > Hmm... don't know yet. Could you please last 20K lines prior the crash from > > e.g two sample OSDs? > > https://storage.linuxbox.cz/index.php/s/o5bMaGMiZQxWadi > > > > > And the crash isn't permanent, OSDs are able to start after the second(?) > > shot, aren't they? > yes, actually they start after issuing systemctl ceph-osd@xx restart, it just takes > long time performing log recovery.. > > If I can provide more info, please let me know > > BR > > nik > > -- > ------------------------------------- > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava > > tel.: +420 591 166 214 > fax: +420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz > > mobil servis: +420 737 238 656 > email servis: servis@xxxxxxxxxxx > ------------------------------------- > -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@xxxxxxxxxxx ------------------------------------- _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx