Re: slow requests and short OSD failures in small cluster

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 19 Apr 2017 13:44:03 -0700

On Tue, Apr 18, 2017 at 11:34 AM, Peter Maloney
<peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:
> On 04/18/17 11:44, Jogi Hofmüller wrote:
>
> Hi,
>
> Am Dienstag, den 18.04.2017, 13:02 +0200 schrieb mj:
>
> On 04/18/2017 11:24 AM, Jogi Hofmüller wrote:
>
> This might have been true for hammer and older versions of ceph.
> From
> what I can tell now, every snapshot taken reduces performance of
> the
> entire cluster :(
>
> Really? Can others confirm this? Is this a 'wellknown fact'?
> (unknown only to us, perhaps...)
>
> I have to add some more/new details now. We started removing snapshots
> for VMs today. We did this VM for VM and waited some time in between
> while monitoring the cluster.
>
> After having removed all snapshots for the third VM the cluster went
> back to a 'normal' state again: no more slow requests. i/o waits for
> VMs are down to acceptable numbers again (<10% peeks, <5% average).
>
> So, either there is one VM/image that irritates the entire cluster or
> we reached some kind of threshold or it's something completely
> different.
>
> As for the well known fact: Peter Maloney pointed that out in this
> thread (mail from last Thursday).
>
> The well known fact part was CoW which I guess is for all versions.
>
> The 'slower with every snapshot even after CoW totally flattens it' issue I
> just find easy to test, and I didn't test it on hammer or earlier, and
> others confirmed it, but didn't keep track of the versions. Just make an rbd
> image, map it (probably... but my tests were with qemu librbd), do fio
> randwrite tests with sync and direct on the device (no need for a fs, or
> anything), and then make a few snaps and watch it go way slower.

I'm not sure this is a correct diagnosis or assessment.

In general, snapshots incur costs in two places:
1) the first write to an object after it is logically snapshotted,
2) when removing snapshots.

There should be no long-term performance degradation, especially in
XFS — it creates new copies of objects for each snapshot they change
in. (btrfs and bluestore use block-based CoW, so they can suffer from
fragmentation if things go too badly.)
However, the costs of snapshot trimming (especially in Jewel) have
been much discussed recently. (I'll have some announcements about
improvements there soon!) So if you've got live trims happening, yes,
there's an incremental load on the cluster.

Similarly, creating a snapshot requires copying each snapshotted
object into a new location, and then applying the write. Generally,
that should amortize into nothingness, but it sounds like in this case
you were basically doing a single IO per object for every snapshot you
created — which, yes, would be impressively slow overall.

The reports I've seen of slow snapshots have been one of the two above
issues. Sometimes it's compounded by people not having enough
incremental IOPS available to support their client workload while
doing snapshots, but that doesn't mean snapshots are inherently
expensive or inefficient[1], just that they do have a non-zero cost
which your cluster needs to be able to provide.
-Greg

[1]: Although, yes, snap trimming is more expensive than in many
similar systems. There are reasons for that which I discussed at Vault
and will present on again at the upcoming OpenStack Boston Ceph day.
:)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com