On 09/16/2017 01:36 AM, Gregory Farnum wrote: > On Mon, Sep 11, 2017 at 1:10 PM Florian Haas <florian@xxxxxxxxxxx > <mailto:florian@xxxxxxxxxxx>> wrote: > > On Mon, Sep 11, 2017 at 8:27 PM, Mclean, Patrick > <Patrick.Mclean@xxxxxxxx <mailto:Patrick.Mclean@xxxxxxxx>> wrote: > > > > On 2017-09-08 06:06 PM, Gregory Farnum wrote: > > > On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick > <Patrick.Mclean@xxxxxxxx <mailto:Patrick.Mclean@xxxxxxxx>> wrote: > > > > > >> On a related note, we are very curious why the snapshot id is > > >> incremented when a snapshot is deleted, this creates lots > > >> phantom entries in the deleted snapshots set. Interleaved > > >> deletions and creations will cause massive fragmentation in > > >> the interval set. The only reason we can come up for this > > >> is to track if anything changed, but I suspect a different > > >> value that doesn't inject entries in to the interval set might > > >> be better for this purpose. > > > Yes, it's because having a sequence number tied in with the > snapshots > > > is convenient for doing comparisons. Those aren't leaked snapids > that > > > will make holes; when we increment the snapid to delete something we > > > also stick it in the removed_snaps set. (I suppose if you alternate > > > deleting a snapshot with adding one that does increase the size > until > > > you delete those snapshots; hrmmm. Another thing to avoid doing I > > > guess.) > > > > > > > > > Fair enough, though it seems like these limitations of the > > snapshot system should be documented. > > This is why I was so insistent on numbers, formulae or even > rules-of-thumb to predict what works and what does not. Greg's "one > snapshot per RBD per day is probably OK" from a few months ago seemed > promising, but looking at your situation it's probably not that useful > a rule. > > > > We most likely would > > have used a completely different strategy if it was documented > > that certain snapshot creation and removal patterns could > > cause the cluster to fall over over time. > > I think right now there are probably very few people, if any, who > could *describe* the pattern that causes this. That complicates > matters of documentation. :) > > > > >>> It might really just be the osdmap update processing -- that would > > >>> make me happy as it's a much easier problem to resolve. But > I'm also > > >>> surprised it's *that* expensive, even at the scales you've > described. > > ^^ This is what I mean. It's kind of tough to document things if we're > still in "surprised that this is causing harm" territory. > > > > >> That would be nice, but unfortunately all the data is pointing > > >> to PGPool::Update(), > > > Yes, that's the OSDMap update processing I referred to. This is good > > > in terms of our ability to remove it without changing client > > > interfaces and things. > > > > That is good to hear, hopefully this stuff can be improved soon > > then. > > Greg, can you comment on just how much potential improvement you see > here? Is it more like "oh we know we're doing this one thing horribly > inefficiently, but we never thought this would be an issue so we shied > away from premature optimization, but we can easily reduce 70% CPU > utilization to 1%" or rather like "we might be able to improve this by > perhaps 5%, but 100,000 RBDs is too many if you want to be using > snapshotting at all, for the foreseeable future"? > > > I got the chance to discuss this a bit with Patrick at the Open Source > Summit Wednesday (good to see you!). > > So the idea in the previously-referenced CDM talk essentially involves > changing the way we distribute snap deletion instructions from a > "deleted_snaps" member in the OSDMap to a "deleting_snaps" member that > gets trimmed once the OSDs report to the manager that they've finished > removing that snapid. This should entirely resolve the CPU burn they're > seeing during OSDMap processing on the nodes, as it shrinks the > intersection operation down from "all the snaps" to merely "the snaps > not-done-deleting". > > The other reason we maintain the full set of deleted snaps is to prevent > client operations from re-creating deleted snapshots — we filter all > client IO which includes snaps against the deleted_snaps set in the PG. > Apparently this is also big enough in RAM to be a real (but much > smaller) problem. > > Unfortunately eliminating that is a lot harder Just checking here, for clarification: what is "that" here? Are you saying that eliminating the full set of deleted snaps is harder than introducing a deleting_snaps member, or that both are harder than potential mitigation strategies that were previously discussed in this thread? > and a permanent fix will > involve changing the client protocol in ways nobody has quite figured > out how to do. But Patrick did suggest storing the full set of deleted > snaps on-disk and only keeping in-memory the set which covers snapids in > the range we've actually *seen* from clients. I haven't gone through the > code but that seems broadly feasible — the hard part will be working out > the rules when you have to go to disk to read a larger part of the > deleted_snaps set. (Perfectly feasible.) > > PRs are of course welcome! ;) Right, so all of the above is about how this can be permanently fixed by what looks to be a fairly invasive rewrite of some core functionality — which is of course a good discussion to have, but it would be good to also have a suggestion for users who want to avoid running into the situation that Patrick and team are in, right now. So at the risk of sounding obnoxiously repetitive, can I reiterate this earlier question of mine? > This is why I was so insistent on numbers, formulae or even > rules-of-thumb to predict what works and what does not. Greg's "one > snapshot per RBD per day is probably OK" from a few months ago seemed > promising, but looking at your situation it's probably not that useful > a rule. Is there something that you can suggest here, perhaps taking into account the discussion you had with Patrick last week? Cheers, Florian
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com