On 2017-09-05 02:41 PM, Gregory Farnum wrote: > On Tue, Sep 5, 2017 at 1:44 PM, Florian Haas <florian@xxxxxxxxxxx> > wrote: >> Hi everyone, >> >> with the Luminous release out the door and the Labor Day weekend >> over, I hope I can kick off a discussion on another issue that has >> irked me a bit for quite a while. There doesn't seem to be a good >> documented answer to this: what are Ceph's real limits when it >> comes to RBD snapshots? >> >> For most people, any RBD image will have perhaps a single-digit >> number of snapshots. For example, in an OpenStack environment we >> typically have one snapshot per Glance image, a few snapshots per >> Cinder volume, and perhaps a few snapshots per ephemeral Nova disk >> (unless clones are configured to flatten immediately). Ceph >> generally performs well under those circumstances. >> >> However, things sometimes start getting problematic when RBD >> snapshots are generated frequently, and in an automated fashion. >> I've seen Ceph operators configure snapshots on a daily or even >> hourly basis, typically when using snapshots as a backup strategy >> (where they promise to allow for very short RTO and RPO). In >> combination with thousands or maybe tens of thousands of RBDs, >> that's a lot of snapshots. And in such scenarios (and only in >> those), users have been bitten by a few nasty bugs in the past — >> here's an example where the OSD snap trim queue went berserk in the >> event of lots of snapshots being deleted: >> >> http://tracker.ceph.com/issues/9487 >> https://www.spinics.net/lists/ceph-devel/msg20470.html >> >> It seems to me that there still isn't a good recommendation along >> the lines of "try not to have more than X snapshots per RBD image" >> or "try not to have more than Y snapshots in the cluster overall". >> Or is the "correct" recommendation actually "create as many >> snapshots as you might possibly want, none of that is allowed to >> create any instability nor performance degradation and if it does, >> that's a bug"? > > I think we're closer to "as many snapshots as you want", but there > are some known shortages there. > > First of all, if you haven't seen my talk from the last OpenStack > summit on snapshots and you want a bunch of details, go watch that. > :p > https://www.openstack.org/videos/boston-2017/ceph-snapshots-for-fun-and-profit-1 There are a few dimensions there can be failures with snapshots: > 1) right now the way we mark snapshots as deleted is suboptimal — > when deleted they go into an interval_set in the OSDMap. So if you > have a bunch of holes in your deleted snapshots, it is possible to > inflate the osdmap to a size which causes trouble. But I'm not sure > if we've actually seen this be an issue yet — it requires both a > large cluster, and a large map, and probably some other failure > causing osdmaps to be generated very rapidly. In our use case, we are severly hampered by the size of removed_snaps (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in PGPool::update and its interval calculation code. We have a cluster of around 100k RBDs with each RBD having upto 25 snapshots and only a small portion of our RBDs mapped at a time (~500-1000). For size / performance reasons we try to keep the number of snapshots low (<25) and need to prune snapshots. Since in our use case RBDs 'age' at different rates, snapshot pruning creates holes to the point where we the size of the removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph clusters. I think in general around 2 snapshot removal operations currently happen a minute just because of the volume of snapshots and users we have. We found the PGPool::update and the interval calculation code code to be quite inefficient. Some small changes made it a lot faster giving more breathing room, we shared and these and most already got applied: https://github.com/ceph/ceph/pull/17088 https://github.com/ceph/ceph/pull/17121 https://github.com/ceph/ceph/pull/17239 https://github.com/ceph/ceph/pull/17265 https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes) However for our use case these patches helped, but overall CPU usage in this area is still high (>70% or so), making the Ceph cluster slow causing blocked requests and many operations (e.g. rbd map) to take a long time. We are trying to work around these issues by trying to change our snapshot strategy. In the short-term we are manually defragmenting the interval set by scanning for holes and trying to delete snapids in between holes to coalesce more holes. This is not so nice to do. In some cases we employ strategies to 'recreate' old snapshots (as we need to keep them) at higher snapids. For our use case a 'snapid rename' feature would have been quite helpful. I hope this shines some light on practical Ceph clusters in which performance is bottlenecked not by I/O but by snapshot removal. > 2) There may be issues with how rbd records what snapshots it is > associated with? No idea about this; haven't heard of any. > > 3) Trimming snapshots requires IO. This is where most (all?) of the > issues I've seen have come from; either in it being unscheduled IO > that the rest of the system doesn't account for or throttle (as in > the links you highlighted) or in admins overwhelming the IO capacity > of their clusters. At this point I think we've got everything being > properly scheduled so it shouldn't break your cluster, but you can > build up large queues of deferred work. As mentioned above, we have been seeing that trimming is much more CPU bound than IO bound. Our disks are mostly sitting idle while the OSD daemons are completely pegging all of the CPUs in the cluster. We are not in any way IO bound at this point, and we are certainly not overwhelming the IO capacity of our clusters. > > > -Greg > >> >> Looking forward to your thoughts. Thanks in advance! >> >> Cheers, Florian _______________________________________________ >> ceph-users mailing list ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing > list ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com