Hello Igor, > You didn't reset the counters every hour, do you? So having average > subop_w_latency growing that way means the current values were much higher > than before. bummer, I didn't.. I've updated gather script to reset stats, wait 10m and then gather perf data, each hour. It's running since yesterday, so now we'll have to wait about one week for the problem to appear again.. > > Curious if subop latencies were growing for every OSD or just a subset (may > be even just a single one) of them? since I only have long time averaga, it's not easy to say, but based on what we have: only two OSDs avg got sub_w_lat > 0.0006. no clear relation between them 19 OSDs got avg sub_w_lat > 0.0005 - this is more interesting - 15 out of them are on those later installed nodes (note that those nodes have almost no VMs running so they are much less used!) 4 are on other nodes. but also note, that not all of OSDs on suspicious nodes are over the threshold, it's 6, 6 and 3 out of 7 OSDs on the node. but still it's strange.. > > > Next time you reach the bad state please do the following if possible: > > - reset perf counters for every OSD > > - leave the cluster running for 10 mins and collect perf counters again. > > - Then start restarting OSD one-by-one starting with the worst OSD (in terms > of subop_w_lat from the prev step). Wouldn't be sufficient to reset just a > few OSDs before the cluster is back to normal? will do once it slows down again. > > > > I see very similar crash reported here:https://tracker.ceph.com/issues/56346 > > so I'm not reporting.. > > > > Do you think this might somehow be the cause of the problem? Anything else I should > > check in perf dumps or elsewhere? > > Hmm... don't know yet. Could you please last 20K lines prior the crash from > e.g two sample OSDs? https://storage.linuxbox.cz/index.php/s/o5bMaGMiZQxWadi > > And the crash isn't permanent, OSDs are able to start after the second(?) > shot, aren't they? yes, actually they start after issuing systemctl ceph-osd@xx restart, it just takes long time performing log recovery.. If I can provide more info, please let me know BR nik -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@xxxxxxxxxxx ------------------------------------- _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx