On Thu, Sep 7, 2017 at 1:46 PM, Mclean, Patrick <Patrick.Mclean@xxxxxxxx> wrote: > On 2017-09-05 02:41 PM, Gregory Farnum wrote: >> On Tue, Sep 5, 2017 at 1:44 PM, Florian Haas <florian@xxxxxxxxxxx> > wrote: >> Hi everyone, >> >> with the Luminous release out the door > and the Labor Day weekend >> over, I hope I can kick off a discussion on > another issue that has >> irked me a bit for quite a while. There > doesn't seem to be a good >> documented answer to this: what are Ceph's > real limits when it >> comes to RBD snapshots? >> >> For most people, > any RBD image will have perhaps a single-digit >> number of snapshots. > For example, in an OpenStack environment we >> typically have one > snapshot per Glance image, a few snapshots per >> Cinder volume, and > perhaps a few snapshots per ephemeral Nova disk >> (unless clones are > configured to flatten immediately). Ceph >> generally performs well > under those circumstances. >> >> However, things sometimes start getting > problematic when RBD >> snapshots are generated frequently, and in an > automated fashion. >> I've seen Ceph operators configure snapshots on a > daily or even >> hourly basis, typically when using snapshots as a > backup strategy >> (where they promise to allow for very short RTO and > RPO). In >> combination with thousands or maybe tens of thousands of > RBDs, >> that's a lot of snapshots. And in such scenarios (and only in >>> those), users have been bitten by a few nasty bugs in the past — >> > here's an example where the OSD snap trim queue went berserk in the >> > event of lots of snapshots being deleted: >> >> > http://tracker.ceph.com/issues/9487 >> > https://www.spinics.net/lists/ceph-devel/msg20470.html >> >> It seems to > me that there still isn't a good recommendation along >> the lines of > "try not to have more than X snapshots per RBD image" >> or "try not to > have more than Y snapshots in the cluster overall". >> Or is the > "correct" recommendation actually "create as many >> snapshots as you > might possibly want, none of that is allowed to >> create any > instability nor performance degradation and if it does, >> that's a > bug"? > > I think we're closer to "as many snapshots as you want", but > there > are some known shortages there. > > First of all, if you haven't > seen my talk from the last OpenStack > summit on snapshots and you want > a bunch of details, go watch that. > :p > > https://www.openstack.org/videos/boston-2017/ceph-snapshots-for-fun-and-profit-1 > > There are a few dimensions there can be failures with snapshots: > >> 1) right now the way we mark snapshots as deleted is suboptimal — > when deleted they go into an interval_set in the OSDMap. So if you > > have a bunch of holes in your deleted snapshots, it is possible to > > inflate the osdmap to a size which causes trouble. But I'm not sure > if > we've actually seen this be an issue yet — it requires both a > large > cluster, and a large map, and probably some other failure > causing > osdmaps to be generated very rapidly. > In our use case, we are severly hampered by the size of removed_snaps > (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in > PGPool::update and its interval calculation code. We have a cluster of > around 100k RBDs with each RBD having upto 25 snapshots and only a small > portion of our RBDs mapped at a time (~500-1000). For size / performance > reasons we try to keep the number of snapshots low (<25) and need to > prune snapshots. Since in our use case RBDs 'age' at different rates, > snapshot pruning creates holes to the point where we the size of the > removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph > clusters. I think in general around 2 snapshot removal operations > currently happen a minute just because of the volume of snapshots and > users we have. > > We found the PGPool::update and the interval calculation code code to be > quite inefficient. Some small changes made it a lot faster giving more > breathing room, we shared and these and most already got applied: > https://github.com/ceph/ceph/pull/17088 > https://github.com/ceph/ceph/pull/17121 > https://github.com/ceph/ceph/pull/17239 > https://github.com/ceph/ceph/pull/17265 > https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes) > > However for our use case these patches helped, but overall CPU usage in > this area is still high (>70% or so), making the Ceph cluster slow > causing blocked requests and many operations (e.g. rbd map) to take a > long time. > > We are trying to work around these issues by trying to change our > snapshot strategy. In the short-term we are manually defragmenting the > interval set by scanning for holes and trying to delete snapids in > between holes to coalesce more holes. This is not so nice to do. In some > cases we employ strategies to 'recreate' old snapshots (as we need to > keep them) at higher snapids. For our use case a 'snapid rename' feature > would have been quite helpful. > > I hope this shines some light on practical Ceph clusters in which > performance is bottlenecked not by I/O but by snapshot removal. There's one thing that confuses me about this. Is all your cpu usage really coming from handling osdmap updates and the interval_set calculations there? Or is some of it coming out of PG::filter_snapc() and its use of the contains() function? We discussed improvements to distributing the deleted snapshots set in CDM a few days ago (http://tracker.ceph.com/projects/ceph/wiki/CDM_06-SEP-2017) and there's a good path forward for keeping the amount of data in the OSDMap down, which will certainly improve life for those intersection_of operations. But we don't yet have a good solution for the per-operation filtering that we do (but it only runs "contains" operations on what is comparatively a very small set of IDs). It might really just be the osdmap update processing -- that would make me happy as it's a much easier problem to resolve. But I'm also surprised it's *that* expensive, even at the scales you've described. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com