Re: RBD: How many snapshots is too many?

"Mclean, Patrick" <Patrick.Mclean@xxxxxxxx> · Sat, 9 Sep 2017 00:47:11 +0000



On 2017-09-08 01:36 PM, Gregory Farnum wrote:
> On Thu, Sep 7, 2017 at 1:46 PM, Mclean, Patrick <Patrick.Mclean@xxxxxxxx> wrote:
>> On 2017-09-05 02:41 PM, Gregory Farnum wrote:
>>> On Tue, Sep 5, 2017 at 1:44 PM, Florian Haas <florian@xxxxxxxxxxx> > wrote: >> Hi everyone, >> >> with the Luminous release out the door
>> and the Labor Day weekend >> over, I hope I can kick off a discussion on
>> another issue that has >> irked me a bit for quite a while. There
>> doesn't seem to be a good >> documented answer to this: what are Ceph's
>> real limits when it >> comes to RBD snapshots? >> >> For most people,
>> any RBD image will have perhaps a single-digit >> number of snapshots.
>> For example, in an OpenStack environment we >> typically have one
>> snapshot per Glance image, a few snapshots per >> Cinder volume, and
>> perhaps a few snapshots per ephemeral Nova disk >> (unless clones are
>> configured to flatten immediately). Ceph >> generally performs well
>> under those circumstances. >> >> However, things sometimes start getting
>> problematic when RBD >> snapshots are generated frequently, and in an
>> automated fashion. >> I've seen Ceph operators configure snapshots on a
>> daily or even >> hourly basis, typically when using snapshots as a
>> backup strategy >> (where they promise to allow for very short RTO and
>> RPO). In >> combination with thousands or maybe tens of thousands of
>> RBDs, >> that's a lot of snapshots. And in such scenarios (and only in
>>>> those), users have been bitten by a few nasty bugs in the past — >>
>> here's an example where the OSD snap trim queue went berserk in the >>
>> event of lots of snapshots being deleted: >> >>
>> http://tracker.ceph.com/issues/9487 >>
>> https://www.spinics.net/lists/ceph-devel/msg20470.html >> >> It seems to
>> me that there still isn't a good recommendation along >> the lines of
>> "try not to have more than X snapshots per RBD image" >> or "try not to
>> have more than Y snapshots in the cluster overall". >> Or is the
>> "correct" recommendation actually "create as many >> snapshots as you
>> might possibly want, none of that is allowed to >> create any
>> instability nor performance degradation and if it does, >> that's a
>> bug"? > > I think we're closer to "as many snapshots as you want", but
>> there > are some known shortages there. > > First of all, if you haven't
>> seen my talk from the last OpenStack > summit on snapshots and you want
>> a bunch of details, go watch that. > :p >
>> https://www.openstack.org/videos/boston-2017/ceph-snapshots-for-fun-and-profit-1
>>
>> There are a few dimensions there can be failures with snapshots:
>>
>>> 1) right now the way we mark snapshots as deleted is suboptimal — > when deleted they go into an interval_set in the OSDMap. So if you >
>> have a bunch of holes in your deleted snapshots, it is possible to >
>> inflate the osdmap to a size which causes trouble. But I'm not sure > if
>> we've actually seen this be an issue yet — it requires both a > large
>> cluster, and a large map, and probably some other failure > causing
>> osdmaps to be generated very rapidly.
>> In our use case, we are severly hampered by the size of removed_snaps
>> (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in
>> PGPool::update and its interval calculation code. We have a cluster of
>> around 100k RBDs with each RBD having upto 25 snapshots and only a small
>> portion of our RBDs mapped at a time (~500-1000). For size / performance
>> reasons we try to keep the number of snapshots low (<25) and need to
>> prune snapshots. Since in our use case RBDs 'age' at different rates,
>> snapshot pruning creates holes to the point where we the size of the
>> removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph
>> clusters. I think in general around 2 snapshot removal operations
>> currently happen a minute just because of the volume of snapshots and
>> users we have.
>>
>> We found the PGPool::update and the interval calculation code code to be
>> quite inefficient. Some small changes made it a lot faster giving more
>> breathing room, we shared and these and most already got applied:
>> https://github.com/ceph/ceph/pull/17088
>> https://github.com/ceph/ceph/pull/17121
>> https://github.com/ceph/ceph/pull/17239
>> https://github.com/ceph/ceph/pull/17265
>> https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes)
>>
>> However for our use case these patches helped, but overall CPU usage in
>> this area is still high (>70% or so), making the Ceph cluster slow
>> causing blocked requests and many operations (e.g. rbd map) to take a
>> long time.
>>
>> We are trying to work around these issues by trying to change our
>> snapshot strategy. In the short-term we are manually defragmenting the
>> interval set by scanning for holes and trying to delete snapids in
>> between holes to coalesce more holes. This is not so nice to do. In some
>> cases we employ strategies to 'recreate' old snapshots (as we need to
>> keep them) at higher snapids. For our use case a 'snapid rename' feature
>> would have been quite helpful.
>>
>> I hope this shines some light on practical Ceph clusters in which
>> performance is bottlenecked not by I/O but by snapshot removal.
> There's one thing that confuses me about this. Is all your cpu usage
> really coming from handling osdmap updates and the interval_set
> calculations there? Or is some of it coming out of PG::filter_snapc()
> and its use of the contains() function?
We have not seen PG::filter_snapc() or it's use of contains() turn
anywhere near the top of any of our perf dumps (registering at 0%),
it has always been PGPool::Update() that dominates the top of every
dump we have created so far. We have also noticed that boot times
of OSDs is often extremely long, once again spending most of it's
time in PGPool::Update().
> We discussed improvements to distributing the deleted snapshots set in
> CDM a few days ago
> (http://tracker.ceph.com/projects/ceph/wiki/CDM_06-SEP-2017) and
> there's a good path forward for keeping the amount of data in the
> OSDMap down, which will certainly improve life for those
> intersection_of operations. But we don't yet have a good solution for
> the per-operation filtering that we do (but it only runs "contains"
> operations on what is comparatively a very small set of IDs).
We are currently using some patches that simply use std::vector
rather than std::map, which makes a noticeable difference. Before
we started looking at the code, we did notice that using jemalloc
rather than tcmalloc made a noticeable difference, now we have
figured out this is due to all the memory allocations std::map is
doing when you are iterating over it. My colleague is working on
a more generic iterator that we can send as a PR, rather than
just switching some std::map to std::vector. We are also using
some heuristics to try and avoid the intersection_of operations,
such as not regenerating the intersection when nothing changed in
the cluster (using whether he highest snapshot id value changed).

On a related note, we are very curious why the snapshot id is
incremented when a snapshot is deleted, this creates lots
phantom entries in the deleted snapshots set. Interleaved
deletions and creations will cause massive fragmentation in
the interval set. The only reason we can come up for this
is to track if anything changed, but I suspect a different
value that doesn't inject entries in to the interval set might
be better for this purpose.
> It might really just be the osdmap update processing -- that would
> make me happy as it's a much easier problem to resolve. But I'm also
> surprised it's *that* expensive, even at the scales you've described.
That would be nice, but unfortunately all the data is pointing
to PGPool::Update(),
> -Greg
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com