On Fri, 15 Dec 2017, Piotr Dałek wrote:
> On 17-12-14 05:31 PM, David Turner wrote:
> > I've tracked this in a much more manual way. I would grab a random subset
> > [..]
> >
> > This was all on a Hammer cluster. The changes to the snap trimming queues
> > going into the main osd thread made it so that our use case was not viable
> > on Jewel until changes to Jewel that happened after I left. It's exciting
> > that this will actually be a reportable value from the cluster.
> >
> > Sorry that this story doesn't really answer your question, except to say
> > that people aware of this problem likely have a work around for it. However
> > I'm certain that a lot more clusters are impacted by this than are aware of
> > it and being able to quickly see that would be beneficial to troubleshooting
> > problems. Backporting would be nice. I run a few Jewel clusters that have
> > some VM's and it would be nice to see how well the cluster handle snap
> > trimming. But they are much less critical on how much snapshots they do.
>
> Thanks for your response, it pretty much confirms what I though:
> - users aware of issue have their own hacks that don't need to be efficient or
> convenient.
> - users unaware of issue are, well, unaware and at risk of serious service
> disruption once disk space is all used up.
>
> Hopefully it'll be convincing enough for devs. ;)
Your PR looks great! I commented with a nit on the format of the warning
itself.
I expect this is trivial to backport to luminous; it will need to be
partially reimplemented for jewel (with some care around the pg_stat_t and
a different check for the jewel-style health checks).
sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com