Re: RBD: How many snapshots is too many?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 21 Sep 2017 07:53:03 +0000

On Mon, Sep 18, 2017 at 4:11 AM Florian Haas <florian@xxxxxxxxxxx> wrote:
On 09/16/2017 01:36 AM, Gregory Farnum wrote:

> On Mon, Sep 11, 2017 at 1:10 PM Florian Haas <florian@xxxxxxxxxxx

> <mailto:florian@xxxxxxxxxxx>> wrote:

>

>     On Mon, Sep 11, 2017 at 8:27 PM, Mclean, Patrick

>     <Patrick.Mclean@xxxxxxxx <mailto:Patrick.Mclean@xxxxxxxx>> wrote:

>     >

>     > On 2017-09-08 06:06 PM, Gregory Farnum wrote:

>     > > On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick

>     <Patrick.Mclean@xxxxxxxx <mailto:Patrick.Mclean@xxxxxxxx>> wrote:

>     > >

>     > >> On a related note, we are very curious why the snapshot id is

>     > >> incremented when a snapshot is deleted, this creates lots

>     > >> phantom entries in the deleted snapshots set. Interleaved

>     > >> deletions and creations will cause massive fragmentation in

>     > >> the interval set. The only reason we can come up for this

>     > >> is to track if anything changed, but I suspect a different

>     > >> value that doesn't inject entries in to the interval set might

>     > >> be better for this purpose.

>     > > Yes, it's because having a sequence number tied in with the

>     snapshots

>     > > is convenient for doing comparisons. Those aren't leaked snapids

>     that

>     > > will make holes; when we increment the snapid to delete something we

>     > > also stick it in the removed_snaps set. (I suppose if you alternate

>     > > deleting a snapshot with adding one that does increase the size

>     until

>     > > you delete those snapshots; hrmmm. Another thing to avoid doing I

>     > > guess.)

>     > >

>     >

>     >

>     > Fair enough, though it seems like these limitations of the

>     > snapshot system should be documented.

>

>     This is why I was so insistent on numbers, formulae or even

>     rules-of-thumb to predict what works and what does not. Greg's "one

>     snapshot per RBD per day is probably OK" from a few months ago seemed

>     promising, but looking at your situation it's probably not that useful

>     a rule.

>

>

>     > We most likely would

>     > have used a completely different strategy if it was documented

>     > that certain snapshot creation and removal patterns could

>     > cause the cluster to fall over over time.

>

>     I think right now there are probably very few people, if any, who

>     could *describe* the pattern that causes this. That complicates

>     matters of documentation. :)

>

>

>     > >>> It might really just be the osdmap update processing -- that would

>     > >>> make me happy as it's a much easier problem to resolve. But

>     I'm also

>     > >>> surprised it's *that* expensive, even at the scales you've

>     described.

>

>     ^^ This is what I mean. It's kind of tough to document things if we're

>     still in "surprised that this is causing harm" territory.

>

>

>     > >> That would be nice, but unfortunately all the data is pointing

>     > >> to PGPool::Update(),

>     > > Yes, that's the OSDMap update processing I referred to. This is good

>     > > in terms of our ability to remove it without changing client

>     > > interfaces and things.

>     >

>     > That is good to hear, hopefully this stuff can be improved soon

>     > then.

>

>     Greg, can you comment on just how much potential improvement you see

>     here? Is it more like "oh we know we're doing this one thing horribly

>     inefficiently, but we never thought this would be an issue so we shied

>     away from premature optimization, but we can easily reduce 70% CPU

>     utilization to 1%" or rather like "we might be able to improve this by

>     perhaps 5%, but 100,000 RBDs is too many if you want to be using

>     snapshotting at all, for the foreseeable future"?

>

>

> I got the chance to discuss this a bit with Patrick at the Open Source

> Summit Wednesday (good to see you!).

>

> So the idea in the previously-referenced CDM talk essentially involves

> changing the way we distribute snap deletion instructions from a

> "deleted_snaps" member in the OSDMap to a "deleting_snaps" member that

> gets trimmed once the OSDs report to the manager that they've finished

> removing that snapid. This should entirely resolve the CPU burn they're

> seeing during OSDMap processing on the nodes, as it shrinks the

> intersection operation down from "all the snaps" to merely "the snaps

> not-done-deleting".

>

> The other reason we maintain the full set of deleted snaps is to prevent

> client operations from re-creating deleted snapshots — we filter all

> client IO which includes snaps against the deleted_snaps set in the PG.

> Apparently this is also big enough in RAM to be a real (but much

> smaller) problem.

>

> Unfortunately eliminating that is a lot harder

Just checking here, for clarification: what is "that" here? Are you

saying that eliminating the full set of deleted snaps is harder than

introducing a deleting_snaps member, or that both are harder than

potential mitigation strategies that were previously discussed in this

thread?

Eliminating the full set we store on the OSD node is much harder than converting the OSDMap to specify deleting_ rather than deleted_snaps — the former at minimum requires changes to the client protocol and we’re not actually sure how to do it; the latter can be done internally to the cluster and has a well-understood algorithm to implement.

> and a permanent fix will

> involve changing the client protocol in ways nobody has quite figured

> out how to do. But Patrick did suggest storing the full set of deleted

> snaps on-disk and only keeping in-memory the set which covers snapids in

> the range we've actually *seen* from clients. I haven't gone through the

> code but that seems broadly feasible — the hard part will be working out

> the rules when you have to go to disk to read a larger part of the

> deleted_snaps set. (Perfectly feasible.)

>

> PRs are of course welcome! ;)

Right, so all of the above is about how this can be permanently fixed by

what looks to be a fairly invasive rewrite of some core functionality —

which is of course a good discussion to have, but it would be good to

also have a suggestion for users who want to avoid running into the

situation that Patrick and team are in, right now. So at the risk of

sounding obnoxiously repetitive, can I reiterate this earlier question

of mine?

> This is why I was so insistent on numbers, formulae or even

> rules-of-thumb to predict what works and what does not. Greg's "one

> snapshot per RBD per day is probably OK" from a few months ago seemed

> promising, but looking at your situation it's probably not that useful

> a rule.

Is there something that you can suggest here, perhaps taking into

account the discussion you had with Patrick last week?

I think I’ve already shared everything I have on this. Try to treat sequential snaps the same way and don’t create a bunch of holes in the interval set.

Cheers,

Florian

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com