Hi Paul and others, while digging deeper, I noticed that when the cluster gets into this state, osd_map_cache_miss on OSDs starts growing rapidly.. even when I increased osd map cache size to 500 (which was the default at least for luminous) it behaves the same.. I think this could be related.. I'll try playing more with cache settings.. BR nik On Wed, Mar 11, 2020 at 03:40:04PM +0100, Paul Emmerich wrote: > Encountered this one again today, I've updated the issue with new > information: https://tracker.ceph.com/issues/44184 > > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > On Sat, Feb 29, 2020 at 10:21 PM Nikola Ciprich > <nikola.ciprich@xxxxxxxxxxx> wrote: > > > > Hi, > > > > I just wanted to report we've just hit very similar problem.. on mimic > > (13.2.6). Any manipulation with OSD (ie restart) causes lot of slow > > ops caused by waiting for new map. It seems those are slowed by SATA > > OSDs which keep being 100% busy reading for long time until all ops are gone, > > blocking OPS on unrelated NVME pools - SATA pools are completely unused now. > > > > is this possible that those maps are being requested from slow SATA OSDs > > and it takes such a long time for some reason? why could it take so long? > > the cluster is very small with very light load.. > > > > BR > > > > nik > > > > > > > > On Wed, Feb 19, 2020 at 10:03:35AM +0100, Wido den Hollander wrote: > > > > > > > > > On 2/19/20 9:34 AM, Paul Emmerich wrote: > > > > On Wed, Feb 19, 2020 at 7:26 AM Wido den Hollander <wido@xxxxxxxx> wrote: > > > >> > > > >> > > > >> > > > >> On 2/18/20 6:54 PM, Paul Emmerich wrote: > > > >>> I've also seen this problem on Nautilus with no obvious reason for the > > > >>> slowness once. > > > >> > > > >> Did this resolve itself? Or did you remove the pool? > > > > > > > > I've seen this twice on the same cluster, it fixed itself the first > > > > time (maybe with some OSD restarts?) and the other time I removed the > > > > pool after a few minutes because the OSDs were running into heartbeat > > > > timeouts. There unfortunately seems to be no way to reproduce this :( > > > > > > > > > > Yes, that's the problem. I've been trying to reproduce it, but I can't. > > > It works on all my Nautilus systems except for this one. > > > > > > As you saw it, Bryan saw it, I expect others to encounter this at some > > > point as well. > > > > > > I don't have any extensive logging as this cluster is in production and > > > I can't simply crank up the logging and try again. > > > > > > > In this case it wasn't a new pool that caused problems but a very old one. > > > > > > > > > > > > Paul > > > > > > > >> > > > >>> In my case it was a rather old cluster that was upgraded all the way > > > >>> from firefly > > > >>> > > > >>> > > > >> > > > >> This cluster has also been installed with Firefly. It was installed in > > > >> 2015, so a while ago. > > > >> > > > >> Wido > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > > -- > > ------------------------------------- > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax: +420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: servis@xxxxxxxxxxx > > ------------------------------------- > -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@xxxxxxxxxxx ------------------------------------- _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx