> In our use case, we are severly hampered by the size of removed_snaps > (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in > PGPool::update and its interval calculation code. We have a cluster of > around 100k RBDs with each RBD having upto 25 snapshots and only a small > portion of our RBDs mapped at a time (~500-1000). For size / performance > reasons we try to keep the number of snapshots low (<25) and need to > prune snapshots. Since in our use case RBDs 'age' at different rates, > snapshot pruning creates holes to the point where we the size of the > removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph > clusters. I think in general around 2 snapshot removal operations > currently happen a minute just because of the volume of snapshots and > users we have. Right. Greg, this is what I was getting at: 25 snapshots per RBD is firmly in "one snapshot per day per RBD" territory — this is something that a cloud operator might do, for example, offering daily snapshots going back one month. But it still wrecks the cluster simply by having lots of images (even though only a fraction of them, less than 1%, are ever in use). That's rather counter-intuitive, it doesn't hit you until you have lots of images, and once you're affected by it there's no practical way out — where "out" is defined as "restoring overall cluster performance to something acceptable". > We found the PGPool::update and the interval calculation code code to be > quite inefficient. Some small changes made it a lot faster giving more > breathing room, we shared and these and most already got applied: > https://github.com/ceph/ceph/pull/17088 > https://github.com/ceph/ceph/pull/17121 > https://github.com/ceph/ceph/pull/17239 > https://github.com/ceph/ceph/pull/17265 > https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes) > > However for our use case these patches helped, but overall CPU usage in > this area is still high (>70% or so), making the Ceph cluster slow > causing blocked requests and many operations (e.g. rbd map) to take a > long time. I think this makes this very much a practical issue, not a hypothetical/theoretical one. > We are trying to work around these issues by trying to change our > snapshot strategy. In the short-term we are manually defragmenting the > interval set by scanning for holes and trying to delete snapids in > between holes to coalesce more holes. This is not so nice to do. In some > cases we employ strategies to 'recreate' old snapshots (as we need to > keep them) at higher snapids. For our use case a 'snapid rename' feature > would have been quite helpful. > > I hope this shines some light on practical Ceph clusters in which > performance is bottlenecked not by I/O but by snapshot removal. For others following this thread or retrieving it from the list archive some time down the road, I'd rephrase that as "bottlenecked not by I/O but by CPU utilization associated with snapshot removal". Is that fair to say, Patrick? Please correct me if I'm misrepresenting. Greg (or Josh/Jason/Sage/anyone really :) ), can you provide additional insight as to how these issues can be worked around or mitigated, besides the PRs that Patrick and his colleagues have already sent? Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com