On Fri, 15 Dec 2017, Piotr Dałek wrote: > On 17-12-14 05:31 PM, David Turner wrote: > > I've tracked this in a much more manual way. I would grab a random subset > > [..] > > > > This was all on a Hammer cluster. The changes to the snap trimming queues > > going into the main osd thread made it so that our use case was not viable > > on Jewel until changes to Jewel that happened after I left. It's exciting > > that this will actually be a reportable value from the cluster. > > > > Sorry that this story doesn't really answer your question, except to say > > that people aware of this problem likely have a work around for it. However > > I'm certain that a lot more clusters are impacted by this than are aware of > > it and being able to quickly see that would be beneficial to troubleshooting > > problems. Backporting would be nice. I run a few Jewel clusters that have > > some VM's and it would be nice to see how well the cluster handle snap > > trimming. But they are much less critical on how much snapshots they do. > > Thanks for your response, it pretty much confirms what I though: > - users aware of issue have their own hacks that don't need to be efficient or > convenient. > - users unaware of issue are, well, unaware and at risk of serious service > disruption once disk space is all used up. > > Hopefully it'll be convincing enough for devs. ;) Your PR looks great! I commented with a nit on the format of the warning itself. I expect this is trivial to backport to luminous; it will need to be partially reimplemented for jewel (with some care around the pg_stat_t and a different check for the jewel-style health checks). sage