Re: [ceph-users] Snap trim queue length issues

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 15 Dec 2017 14:58:04 +0000 (UTC)

On Fri, 15 Dec 2017, Piotr Dałek wrote:
> On 17-12-14 05:31 PM, David Turner wrote:
> > I've tracked this in a much more manual way.  I would grab a random subset
> > [..]
> > 
> > This was all on a Hammer cluster.  The changes to the snap trimming queues
> > going into the main osd thread made it so that our use case was not viable
> > on Jewel until changes to Jewel that happened after I left.  It's exciting
> > that this will actually be a reportable value from the cluster.
> > 
> > Sorry that this story doesn't really answer your question, except to say
> > that people aware of this problem likely have a work around for it.  However
> > I'm certain that a lot more clusters are impacted by this than are aware of
> > it and being able to quickly see that would be beneficial to troubleshooting
> > problems.  Backporting would be nice.  I run a few Jewel clusters that have
> > some VM's and it would be nice to see how well the cluster handle snap
> > trimming.  But they are much less critical on how much snapshots they do.
> 
> Thanks for your response, it pretty much confirms what I though:
> - users aware of issue have their own hacks that don't need to be efficient or
> convenient.
> - users unaware of issue are, well, unaware and at risk of serious service
> disruption once disk space is all used up.
> 
> Hopefully it'll be convincing enough for devs. ;)

Your PR looks great!  I commented with a nit on the format of the warning 
itself.

I expect this is trivial to backport to luminous; it will need to be 
partially reimplemented for jewel (with some care around the pg_stat_t and 
a different check for the jewel-style health checks).

sage