On 17-09-16 01:36 AM, Gregory Farnum wrote:
I got the chance to discuss this a bit with Patrick at the Open Source
Summit Wednesday (good to see you!).
So the idea in the previously-referenced CDM talk essentially involves
changing the way we distribute snap deletion instructions from a
"deleted_snaps" member in the OSDMap to a "deleting_snaps" member that gets
trimmed once the OSDs report to the manager that they've finished removing
that snapid. This should entirely resolve the CPU burn they're seeing during
OSDMap processing on the nodes, as it shrinks the intersection operation
down from "all the snaps" to merely "the snaps not-done-deleting".
The other reason we maintain the full set of deleted snaps is to prevent
client operations from re-creating deleted snapshots — we filter all client
IO which includes snaps against the deleted_snaps set in the PG. Apparently
this is also big enough in RAM to be a real (but much smaller) problem.
Unfortunately eliminating that is a lot harder and a permanent fix will
involve changing the client protocol in ways nobody has quite figured out
how to do. But Patrick did suggest storing the full set of deleted snaps
on-disk and only keeping in-memory the set which covers snapids in the range
we've actually *seen* from clients. I haven't gone through the code but that
seems broadly feasible — the hard part will be working out the rules when
you have to go to disk to read a larger part of the deleted_snaps set.
(Perfectly feasible.)
PRs are of course welcome! ;)
There you go: https://github.com/ceph/ceph/pull/17493
We are hitting limitations of current implementation - we have over 9
thousands of removed snap intervals, with snap counter over 650000. In our
particular case, this shows up as a bad CPU usage spike every few minutes,
and it's going to be only worse, as we're going to have more snapshots over
time. My PR halves that spike, and is a change small enough to be backported
to both Jewel and Luminous without breaking too much at once - not a final
solution, but should make life a bit more tolerable until actual, working
solution is in place.
--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovh.com/us/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html