Re: [ceph-users] RBD: How many snapshots is too many?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 17-09-16 01:36 AM, Gregory Farnum wrote:
I got the chance to discuss this a bit with Patrick at the Open Source Summit Wednesday (good to see you!).

So the idea in the previously-referenced CDM talk essentially involves changing the way we distribute snap deletion instructions from a "deleted_snaps" member in the OSDMap to a "deleting_snaps" member that gets trimmed once the OSDs report to the manager that they've finished removing that snapid. This should entirely resolve the CPU burn they're seeing during OSDMap processing on the nodes, as it shrinks the intersection operation down from "all the snaps" to merely "the snaps not-done-deleting".

The other reason we maintain the full set of deleted snaps is to prevent client operations from re-creating deleted snapshots — we filter all client IO which includes snaps against the deleted_snaps set in the PG. Apparently this is also big enough in RAM to be a real (but much smaller) problem.

Unfortunately eliminating that is a lot harder and a permanent fix will involve changing the client protocol in ways nobody has quite figured out how to do. But Patrick did suggest storing the full set of deleted snaps on-disk and only keeping in-memory the set which covers snapids in the range we've actually *seen* from clients. I haven't gone through the code but that seems broadly feasible — the hard part will be working out the rules when you have to go to disk to read a larger part of the deleted_snaps set. (Perfectly feasible.)

PRs are of course welcome! ;)

There you go: https://github.com/ceph/ceph/pull/17493

We are hitting limitations of current implementation - we have over 9 thousands of removed snap intervals, with snap counter over 650000. In our particular case, this shows up as a bad CPU usage spike every few minutes, and it's going to be only worse, as we're going to have more snapshots over time. My PR halves that spike, and is a change small enough to be backported to both Jewel and Luminous without breaking too much at once - not a final solution, but should make life a bit more tolerable until actual, working solution is in place.

--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovh.com/us/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux