Fwd: Possible bug in op path?

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Wed, 20 May 2020 00:50:48 -0700

De-HTMLified
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

---------- Forwarded message ---------
From: Robert LeBlanc <robert@xxxxxxxxxxxxx>
Date: Wed, May 20, 2020 at 12:40 AM
Subject: Possible bug in op path?
To: ceph-users <ceph-users@xxxxxxx>, ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>

We upgraded our Jewel cluster to Nautilus a few months ago and I've
noticed that op behavior has changed. This is an HDD cluster (NVMe
journals and NVMe CephFS metadata pool) with about 800 OSDs. When on
Jewel and running WPQ with the high cut-off, it was rock solid. When
we had recoveries going on it barely dented the client ops and when
the client ops on the cluster went down the backfills would run as
fast as the cluster could go. I could have max_backfills set to 10 and
the cluster performed admirably.
After upgrading to Nautilus the cluster struggles with any kind of
recovery and if there is any significant client write load the cluster
can get into a death spiral. Even heavy client write bandwidth (3-4
GB/s) can cause the heartbeat checks to raise, blocked IO and even
OSDs becoming unresponsive.
As the person who wrote the WPQ code initially, I know that it was
fair and proportional to the op priority and in Jewel it worked. It's
not working in Nautilus. I've tweaked a lot of things trying to
troubleshoot the issue and setting the recovery priority to 1 or zero
barely makes any difference. My best estimation is that the op
priority is getting lost before reaching the WPQ scheduler and is thus
not prioritizing and dispatching ops correctly. It's almost as if all
ops are being treated the same and there is no priority at all.
Unfortunately, I do not have the time to set up the dev/testing
environment to track this down and we will be moving away from Ceph.
But I really like Ceph and want to see it succeed. I strongly suggest
that someone look into this because I think it will resolve a lot of
problems people have had on the mailing list. I'm not sure if a bug
was introduced with the other queues that touches more of the op path
or if something in the op path restructuring that changed how things
work (I know that was being discussed around the time that Jewel was
released). But my guess is that it is somewhere between the op being
created and being received into the queue.
I really hope that this helps in the search for this regression. I
spent a lot of time studying the issue to come up with WPQ and saw it
work great when I switched this cluster from PRIO to WPQ. I've also
spent countless hours studying how it's changed in Nautilus.

Thank you,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1