Re: RBD: How many snapshots is too many?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 15 Sep 2017 23:36:52 +0000

On Mon, Sep 11, 2017 at 1:10 PM Florian Haas <florian@xxxxxxxxxxx> wrote:
On Mon, Sep 11, 2017 at 8:27 PM, Mclean, Patrick

<Patrick.Mclean@xxxxxxxx> wrote:

>

> On 2017-09-08 06:06 PM, Gregory Farnum wrote:

> > On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick <Patrick.Mclean@xxxxxxxx> wrote:

> >

> >> On a related note, we are very curious why the snapshot id is

> >> incremented when a snapshot is deleted, this creates lots

> >> phantom entries in the deleted snapshots set. Interleaved

> >> deletions and creations will cause massive fragmentation in

> >> the interval set. The only reason we can come up for this

> >> is to track if anything changed, but I suspect a different

> >> value that doesn't inject entries in to the interval set might

> >> be better for this purpose.

> > Yes, it's because having a sequence number tied in with the snapshots

> > is convenient for doing comparisons. Those aren't leaked snapids that

> > will make holes; when we increment the snapid to delete something we

> > also stick it in the removed_snaps set. (I suppose if you alternate

> > deleting a snapshot with adding one that does increase the size until

> > you delete those snapshots; hrmmm. Another thing to avoid doing I

> > guess.)

> >

>

>

> Fair enough, though it seems like these limitations of the

> snapshot system should be documented.

This is why I was so insistent on numbers, formulae or even

rules-of-thumb to predict what works and what does not. Greg's "one

snapshot per RBD per day is probably OK" from a few months ago seemed

promising, but looking at your situation it's probably not that useful

a rule.

> We most likely would

> have used a completely different strategy if it was documented

> that certain snapshot creation and removal patterns could

> cause the cluster to fall over over time.

I think right now there are probably very few people, if any, who

could *describe* the pattern that causes this. That complicates

matters of documentation. :)

> >>> It might really just be the osdmap update processing -- that would

> >>> make me happy as it's a much easier problem to resolve. But I'm also

> >>> surprised it's *that* expensive, even at the scales you've described.

^^ This is what I mean. It's kind of tough to document things if we're

still in "surprised that this is causing harm" territory.

> >> That would be nice, but unfortunately all the data is pointing

> >> to PGPool::Update(),

> > Yes, that's the OSDMap update processing I referred to. This is good

> > in terms of our ability to remove it without changing client

> > interfaces and things.

>

> That is good to hear, hopefully this stuff can be improved soon

> then.

Greg, can you comment on just how much potential improvement you see

here? Is it more like "oh we know we're doing this one thing horribly

inefficiently, but we never thought this would be an issue so we shied

away from premature optimization, but we can easily reduce 70% CPU

utilization to 1%" or rather like "we might be able to improve this by

perhaps 5%, but 100,000 RBDs is too many if you want to be using

snapshotting at all, for the foreseeable future"?

I got the chance to discuss this a bit with Patrick at the Open Source Summit Wednesday (good to see you!).

So the idea in the previously-referenced CDM talk essentially involves changing the way we distribute snap deletion instructions from a "deleted_snaps" member in the OSDMap to a "deleting_snaps" member that gets trimmed once the OSDs report to the manager that they've finished removing that snapid. This should entirely resolve the CPU burn they're seeing during OSDMap processing on the nodes, as it shrinks the intersection operation down from "all the snaps" to merely "the snaps not-done-deleting".

The other reason we maintain the full set of deleted snaps is to prevent client operations from re-creating deleted snapshots — we filter all client IO which includes snaps against the deleted_snaps set in the PG. Apparently this is also big enough in RAM to be a real (but much smaller) problem.

Unfortunately eliminating that is a lot harder and a permanent fix will involve changing the client protocol in ways nobody has quite figured out how to do. But Patrick did suggest storing the full set of deleted snaps on-disk and only keeping in-memory the set which covers snapids in the range we've actually *seen* from clients. I haven't gone through the code but that seems broadly feasible — the hard part will be working out the rules when you have to go to disk to read a larger part of the deleted_snaps set. (Perfectly feasible.)

PRs are of course welcome! ;)
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com