Re: RBD: How many snapshots is too many?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 8 Sep 2017 13:34:59 -0700

On Thu, Sep 7, 2017 at 1:46 PM, Mclean, Patrick <Patrick.Mclean@xxxxxxxx> wrote:
> On 2017-09-05 02:41 PM, Gregory Farnum wrote:
>> On Tue, Sep 5, 2017 at 1:44 PM, Florian Haas <florian@xxxxxxxxxxx> > wrote: >> Hi everyone, >> >> with the Luminous release out the door
> and the Labor Day weekend >> over, I hope I can kick off a discussion on
> another issue that has >> irked me a bit for quite a while. There
> doesn't seem to be a good >> documented answer to this: what are Ceph's
> real limits when it >> comes to RBD snapshots? >> >> For most people,
> any RBD image will have perhaps a single-digit >> number of snapshots.
> For example, in an OpenStack environment we >> typically have one
> snapshot per Glance image, a few snapshots per >> Cinder volume, and
> perhaps a few snapshots per ephemeral Nova disk >> (unless clones are
> configured to flatten immediately). Ceph >> generally performs well
> under those circumstances. >> >> However, things sometimes start getting
> problematic when RBD >> snapshots are generated frequently, and in an
> automated fashion. >> I've seen Ceph operators configure snapshots on a
> daily or even >> hourly basis, typically when using snapshots as a
> backup strategy >> (where they promise to allow for very short RTO and
> RPO). In >> combination with thousands or maybe tens of thousands of
> RBDs, >> that's a lot of snapshots. And in such scenarios (and only in
>>> those), users have been bitten by a few nasty bugs in the past — >>
> here's an example where the OSD snap trim queue went berserk in the >>
> event of lots of snapshots being deleted: >> >>
> http://tracker.ceph.com/issues/9487 >>
> https://www.spinics.net/lists/ceph-devel/msg20470.html >> >> It seems to
> me that there still isn't a good recommendation along >> the lines of
> "try not to have more than X snapshots per RBD image" >> or "try not to
> have more than Y snapshots in the cluster overall". >> Or is the
> "correct" recommendation actually "create as many >> snapshots as you
> might possibly want, none of that is allowed to >> create any
> instability nor performance degradation and if it does, >> that's a
> bug"? > > I think we're closer to "as many snapshots as you want", but
> there > are some known shortages there. > > First of all, if you haven't
> seen my talk from the last OpenStack > summit on snapshots and you want
> a bunch of details, go watch that. > :p >
> https://www.openstack.org/videos/boston-2017/ceph-snapshots-for-fun-and-profit-1
>
> There are a few dimensions there can be failures with snapshots:
>
>> 1) right now the way we mark snapshots as deleted is suboptimal — > when deleted they go into an interval_set in the OSDMap. So if you >
> have a bunch of holes in your deleted snapshots, it is possible to >
> inflate the osdmap to a size which causes trouble. But I'm not sure > if
> we've actually seen this be an issue yet — it requires both a > large
> cluster, and a large map, and probably some other failure > causing
> osdmaps to be generated very rapidly.
> In our use case, we are severly hampered by the size of removed_snaps
> (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in
> PGPool::update and its interval calculation code. We have a cluster of
> around 100k RBDs with each RBD having upto 25 snapshots and only a small
> portion of our RBDs mapped at a time (~500-1000). For size / performance
> reasons we try to keep the number of snapshots low (<25) and need to
> prune snapshots. Since in our use case RBDs 'age' at different rates,
> snapshot pruning creates holes to the point where we the size of the
> removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph
> clusters. I think in general around 2 snapshot removal operations
> currently happen a minute just because of the volume of snapshots and
> users we have.
>
> We found the PGPool::update and the interval calculation code code to be
> quite inefficient. Some small changes made it a lot faster giving more
> breathing room, we shared and these and most already got applied:
> https://github.com/ceph/ceph/pull/17088
> https://github.com/ceph/ceph/pull/17121
> https://github.com/ceph/ceph/pull/17239
> https://github.com/ceph/ceph/pull/17265
> https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes)
>
> However for our use case these patches helped, but overall CPU usage in
> this area is still high (>70% or so), making the Ceph cluster slow
> causing blocked requests and many operations (e.g. rbd map) to take a
> long time.
>
> We are trying to work around these issues by trying to change our
> snapshot strategy. In the short-term we are manually defragmenting the
> interval set by scanning for holes and trying to delete snapids in
> between holes to coalesce more holes. This is not so nice to do. In some
> cases we employ strategies to 'recreate' old snapshots (as we need to
> keep them) at higher snapids. For our use case a 'snapid rename' feature
> would have been quite helpful.
>
> I hope this shines some light on practical Ceph clusters in which
> performance is bottlenecked not by I/O but by snapshot removal.

There's one thing that confuses me about this. Is all your cpu usage
really coming from handling osdmap updates and the interval_set
calculations there? Or is some of it coming out of PG::filter_snapc()
and its use of the contains() function?

We discussed improvements to distributing the deleted snapshots set in
CDM a few days ago
(http://tracker.ceph.com/projects/ceph/wiki/CDM_06-SEP-2017) and
there's a good path forward for keeping the amount of data in the
OSDMap down, which will certainly improve life for those
intersection_of operations. But we don't yet have a good solution for
the per-operation filtering that we do (but it only runs "contains"
operations on what is comparatively a very small set of IDs).

It might really just be the osdmap update processing -- that would
make me happy as it's a much easier problem to resolve. But I'm also
surprised it's *that* expensive, even at the scales you've described.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com