Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

Nikola Ciprich <nikola.ciprich@xxxxxxxxxxx> · Wed, 10 May 2023 07:36:06 +0200

Hello Igor,
> You didn't reset the counters every hour, do you? So having average
> subop_w_latency growing that way means the current values were much higher
> than before.

bummer, I didn't.. I've updated gather script to reset stats, wait 10m and then
gather perf data, each hour. It's running since yesterday, so now we'll have to wait
about one week for the problem to appear again..

> 
> Curious if subop latencies were growing for every OSD or just a subset (may
> be even just a single one) of them?
since I only have long time averaga, it's not easy to say, but based on what we have:

only two OSDs avg got sub_w_lat > 0.0006. no clear relation between them
19 OSDs got avg sub_w_lat > 0.0005 - this is more interesting - 15 out of them
are on those later installed nodes (note that those nodes have almost no VMs running
so they are much less used!) 4 are on other nodes. but also note, that not all
of OSDs on suspicious nodes are over the threshold, it's 6, 6 and 3 out of 7 OSDs
on the node. but still it's strange..

> 
> 
> Next time you reach the bad state please do the following if possible:
> 
> - reset perf counters for every OSD
> 
> -  leave the cluster running for 10 mins and collect perf counters again.
> 
> - Then start restarting OSD one-by-one starting with the worst OSD (in terms
> of subop_w_lat from the prev step). Wouldn't be sufficient to reset just a
> few OSDs before the cluster is back to normal?

will do once it slows down again.

> > 
> > I see very similar crash reported here:https://tracker.ceph.com/issues/56346
> > so I'm not reporting..
> > 
> > Do you think this might somehow be the cause of the problem? Anything else I should
> > check in perf dumps or elsewhere?
> 
> Hmm... don't know yet. Could you please last 20K lines prior the crash from
> e.g two sample OSDs?

https://storage.linuxbox.cz/index.php/s/o5bMaGMiZQxWadi

> 
> And the crash isn't permanent, OSDs are able to start after the second(?)
> shot, aren't they?
yes, actually they start after issuing systemctl ceph-osd@xx restart, it just takes
long time performing log recovery..

If I can provide more info, please let me know

BR

nik

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@xxxxxxxxxxx
-------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx