Re: Possible bug in op path?

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Wed, 20 May 2020 09:56:51 -0700



We are using high and the people on the list that have also changed
have not seen the improvements that I would expect.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, May 20, 2020 at 1:38 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> Hi Robert,
>
> Since you didn't mention -- are you using osd_op_queue_cut_off low or
> high? I know you are usually advocating high, but the default is still
> low and most users don't change this setting.
>
> Cheers, Dan
>
>
> On Wed, May 20, 2020 at 9:41 AM Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> >
> > We upgraded our Jewel cluster to Nautilus a few months ago and I've noticed
> > that op behavior has changed. This is an HDD cluster (NVMe journals and
> > NVMe CephFS metadata pool) with about 800 OSDs. When on Jewel and running
> > WPQ with the high cut-off, it was rock solid. When we had recoveries going
> > on it barely dented the client ops and when the client ops on the cluster
> > went down the backfills would run as fast as the cluster could go. I could
> > have max_backfills set to 10 and the cluster performed admirably.
> > After upgrading to Nautilus the cluster struggles with any kind of recovery
> > and if there is any significant client write load the cluster can get into
> > a death spiral. Even heavy client write bandwidth (3-4 GB/s) can cause the
> > heartbeat checks to raise, blocked IO and even OSDs becoming unresponsive.
> > As the person who wrote the WPQ code initially, I know that it was fair and
> > proportional to the op priority and in Jewel it worked. It's not working in
> > Nautilus. I've tweaked a lot of things trying to troubleshoot the issue and
> > setting the recovery priority to 1 or zero barely makes any difference. My
> > best estimation is that the op priority is getting lost before reaching the
> > WPQ scheduler and is thus not prioritizing and dispatching ops correctly.
> > It's almost as if all ops are being treated the same and there is no
> > priority at all.
> > Unfortunately, I do not have the time to set up the dev/testing environment
> > to track this down and we will be moving away from Ceph. But I really like
> > Ceph and want to see it succeed. I strongly suggest that someone look into
> > this because I think it will resolve a lot of problems people have had on
> > the mailing list. I'm not sure if a bug was introduced with the other
> > queues that touches more of the op path or if something in the op path
> > restructuring that changed how things work (I know that was being discussed
> > around the time that Jewel was released). But my guess is that it is
> > somewhere between the op being created and being received into the queue.
> > I really hope that this helps in the search for this regression. I spent a
> > lot of time studying the issue to come up with WPQ and saw it work great
> > when I switched this cluster from PRIO to WPQ. I've also spent countless
> > hours studying how it's changed in Nautilus.
> >
> > Thank you,
> > Robert LeBlanc
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx