Hi Robert, Since you didn't mention -- are you using osd_op_queue_cut_off low or high? I know you are usually advocating high, but the default is still low and most users don't change this setting. Cheers, Dan On Wed, May 20, 2020 at 9:41 AM Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > > We upgraded our Jewel cluster to Nautilus a few months ago and I've noticed > that op behavior has changed. This is an HDD cluster (NVMe journals and > NVMe CephFS metadata pool) with about 800 OSDs. When on Jewel and running > WPQ with the high cut-off, it was rock solid. When we had recoveries going > on it barely dented the client ops and when the client ops on the cluster > went down the backfills would run as fast as the cluster could go. I could > have max_backfills set to 10 and the cluster performed admirably. > After upgrading to Nautilus the cluster struggles with any kind of recovery > and if there is any significant client write load the cluster can get into > a death spiral. Even heavy client write bandwidth (3-4 GB/s) can cause the > heartbeat checks to raise, blocked IO and even OSDs becoming unresponsive. > As the person who wrote the WPQ code initially, I know that it was fair and > proportional to the op priority and in Jewel it worked. It's not working in > Nautilus. I've tweaked a lot of things trying to troubleshoot the issue and > setting the recovery priority to 1 or zero barely makes any difference. My > best estimation is that the op priority is getting lost before reaching the > WPQ scheduler and is thus not prioritizing and dispatching ops correctly. > It's almost as if all ops are being treated the same and there is no > priority at all. > Unfortunately, I do not have the time to set up the dev/testing environment > to track this down and we will be moving away from Ceph. But I really like > Ceph and want to see it succeed. I strongly suggest that someone look into > this because I think it will resolve a lot of problems people have had on > the mailing list. I'm not sure if a bug was introduced with the other > queues that touches more of the op path or if something in the op path > restructuring that changed how things work (I know that was being discussed > around the time that Jewel was released). But my guess is that it is > somewhere between the op being created and being received into the queue. > I really hope that this helps in the search for this regression. I spent a > lot of time studying the issue to come up with WPQ and saw it work great > when I switched this cluster from PRIO to WPQ. I've also spent countless > hours studying how it's changed in Nautilus. > > Thank you, > Robert LeBlanc > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx