Re: RBD: How many snapshots is too many?

Florian Haas <florian@xxxxxxxxxxx> · Mon, 18 Sep 2017 13:11:25 +0200

On 09/16/2017 01:36 AM, Gregory Farnum wrote:
> On Mon, Sep 11, 2017 at 1:10 PM Florian Haas <florian@xxxxxxxxxxx
> <mailto:florian@xxxxxxxxxxx>> wrote:
> 
>     On Mon, Sep 11, 2017 at 8:27 PM, Mclean, Patrick
>     <Patrick.Mclean@xxxxxxxx <mailto:Patrick.Mclean@xxxxxxxx>> wrote:
>     >
>     > On 2017-09-08 06:06 PM, Gregory Farnum wrote:
>     > > On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick
>     <Patrick.Mclean@xxxxxxxx <mailto:Patrick.Mclean@xxxxxxxx>> wrote:
>     > >
>     > >> On a related note, we are very curious why the snapshot id is
>     > >> incremented when a snapshot is deleted, this creates lots
>     > >> phantom entries in the deleted snapshots set. Interleaved
>     > >> deletions and creations will cause massive fragmentation in
>     > >> the interval set. The only reason we can come up for this
>     > >> is to track if anything changed, but I suspect a different
>     > >> value that doesn't inject entries in to the interval set might
>     > >> be better for this purpose.
>     > > Yes, it's because having a sequence number tied in with the
>     snapshots
>     > > is convenient for doing comparisons. Those aren't leaked snapids
>     that
>     > > will make holes; when we increment the snapid to delete something we
>     > > also stick it in the removed_snaps set. (I suppose if you alternate
>     > > deleting a snapshot with adding one that does increase the size
>     until
>     > > you delete those snapshots; hrmmm. Another thing to avoid doing I
>     > > guess.)
>     > >
>     >
>     >
>     > Fair enough, though it seems like these limitations of the
>     > snapshot system should be documented.
> 
>     This is why I was so insistent on numbers, formulae or even
>     rules-of-thumb to predict what works and what does not. Greg's "one
>     snapshot per RBD per day is probably OK" from a few months ago seemed
>     promising, but looking at your situation it's probably not that useful
>     a rule.
> 
> 
>     > We most likely would
>     > have used a completely different strategy if it was documented
>     > that certain snapshot creation and removal patterns could
>     > cause the cluster to fall over over time.
> 
>     I think right now there are probably very few people, if any, who
>     could *describe* the pattern that causes this. That complicates
>     matters of documentation. :)
> 
> 
>     > >>> It might really just be the osdmap update processing -- that would
>     > >>> make me happy as it's a much easier problem to resolve. But
>     I'm also
>     > >>> surprised it's *that* expensive, even at the scales you've
>     described.
> 
>     ^^ This is what I mean. It's kind of tough to document things if we're
>     still in "surprised that this is causing harm" territory.
> 
> 
>     > >> That would be nice, but unfortunately all the data is pointing
>     > >> to PGPool::Update(),
>     > > Yes, that's the OSDMap update processing I referred to. This is good
>     > > in terms of our ability to remove it without changing client
>     > > interfaces and things.
>     >
>     > That is good to hear, hopefully this stuff can be improved soon
>     > then.
> 
>     Greg, can you comment on just how much potential improvement you see
>     here? Is it more like "oh we know we're doing this one thing horribly
>     inefficiently, but we never thought this would be an issue so we shied
>     away from premature optimization, but we can easily reduce 70% CPU
>     utilization to 1%" or rather like "we might be able to improve this by
>     perhaps 5%, but 100,000 RBDs is too many if you want to be using
>     snapshotting at all, for the foreseeable future"?
> 
> 
> I got the chance to discuss this a bit with Patrick at the Open Source
> Summit Wednesday (good to see you!).
> 
> So the idea in the previously-referenced CDM talk essentially involves
> changing the way we distribute snap deletion instructions from a
> "deleted_snaps" member in the OSDMap to a "deleting_snaps" member that
> gets trimmed once the OSDs report to the manager that they've finished
> removing that snapid. This should entirely resolve the CPU burn they're
> seeing during OSDMap processing on the nodes, as it shrinks the
> intersection operation down from "all the snaps" to merely "the snaps
> not-done-deleting".
> 
> The other reason we maintain the full set of deleted snaps is to prevent
> client operations from re-creating deleted snapshots — we filter all
> client IO which includes snaps against the deleted_snaps set in the PG.
> Apparently this is also big enough in RAM to be a real (but much
> smaller) problem.
> 
> Unfortunately eliminating that is a lot harder

Just checking here, for clarification: what is "that" here? Are you
saying that eliminating the full set of deleted snaps is harder than
introducing a deleting_snaps member, or that both are harder than
potential mitigation strategies that were previously discussed in this
thread?

> and a permanent fix will
> involve changing the client protocol in ways nobody has quite figured
> out how to do. But Patrick did suggest storing the full set of deleted
> snaps on-disk and only keeping in-memory the set which covers snapids in
> the range we've actually *seen* from clients. I haven't gone through the
> code but that seems broadly feasible — the hard part will be working out
> the rules when you have to go to disk to read a larger part of the
> deleted_snaps set. (Perfectly feasible.)
> 
> PRs are of course welcome! ;)

Right, so all of the above is about how this can be permanently fixed by
what looks to be a fairly invasive rewrite of some core functionality —
which is of course a good discussion to have, but it would be good to
also have a suggestion for users who want to avoid running into the
situation that Patrick and team are in, right now. So at the risk of
sounding obnoxiously repetitive, can I reiterate this earlier question
of mine?

> This is why I was so insistent on numbers, formulae or even
> rules-of-thumb to predict what works and what does not. Greg's "one
> snapshot per RBD per day is probably OK" from a few months ago seemed
> promising, but looking at your situation it's probably not that useful
> a rule.

Is there something that you can suggest here, perhaps taking into
account the discussion you had with Patrick last week?

Cheers,
Florian

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com