Re: osd_pg_create causing slow requests in Nautilus

Paul Emmerich <paul.emmerich@xxxxxxxx> · Wed, 11 Mar 2020 15:40:04 +0100

Encountered this one again today, I've updated the issue with new
information: https://tracker.ceph.com/issues/44184

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Feb 29, 2020 at 10:21 PM Nikola Ciprich
<nikola.ciprich@xxxxxxxxxxx> wrote:
>
> Hi,
>
> I just wanted to report we've just hit very similar problem.. on mimic
> (13.2.6). Any manipulation with OSD (ie restart) causes lot of slow
> ops caused by waiting for new map. It seems those are slowed by SATA
> OSDs which keep being 100% busy reading for long time until all ops are gone,
> blocking OPS on unrelated NVME pools - SATA pools are completely unused now.
>
> is this possible that those maps are being requested from slow SATA OSDs
> and it takes such a long time for some reason? why could it take so long?
> the cluster is very small with very light load..
>
> BR
>
> nik
>
>
>
> On Wed, Feb 19, 2020 at 10:03:35AM +0100, Wido den Hollander wrote:
> >
> >
> > On 2/19/20 9:34 AM, Paul Emmerich wrote:
> > > On Wed, Feb 19, 2020 at 7:26 AM Wido den Hollander <wido@xxxxxxxx> wrote:
> > >>
> > >>
> > >>
> > >> On 2/18/20 6:54 PM, Paul Emmerich wrote:
> > >>> I've also seen this problem on Nautilus with no obvious reason for the
> > >>> slowness once.
> > >>
> > >> Did this resolve itself? Or did you remove the pool?
> > >
> > > I've seen this twice on the same cluster, it fixed itself the first
> > > time (maybe with some OSD restarts?) and the other time I removed the
> > > pool after a few minutes because the OSDs were running into heartbeat
> > > timeouts. There unfortunately seems to be no way to reproduce this :(
> > >
> >
> > Yes, that's the problem. I've been trying to reproduce it, but I can't.
> > It works on all my Nautilus systems except for this one.
> >
> > As you saw it, Bryan saw it, I expect others to encounter this at some
> > point as well.
> >
> > I don't have any extensive logging as this cluster is in production and
> > I can't simply crank up the logging and try again.
> >
> > > In this case it wasn't a new pool that caused problems but a very old one.
> > >
> > >
> > > Paul
> > >
> > >>
> > >>> In my case it was a rather old cluster that was upgraded all the way
> > >>> from firefly
> > >>>
> > >>>
> > >>
> > >> This cluster has also been installed with Firefly. It was installed in
> > >> 2015, so a while ago.
> > >>
> > >> Wido
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
>
> --
> -------------------------------------
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
>
> tel.:   +420 591 166 214
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: servis@xxxxxxxxxxx
> -------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx