Re: pg scrub and auto repair in hammer

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Wed, 29 Jun 2016 19:54:19 +0200

Hi,

Le 29/06/2016 18:33, Stefan Priebe - Profihost AG a écrit :
>> Am 28.06.2016 um 09:43 schrieb Lionel Bouton <lionel-subscription@xxxxxxxxxxx>:
>>
>> Hi,
>>
>> Le 28/06/2016 08:34, Stefan Priebe - Profihost AG a écrit :
>>> [...]
>>> Yes but at least BTRFS is still not working for ceph due to
>>> fragmentation. I've even tested a 4.6 kernel a few weeks ago. But it
>>> doubles it's I/O after a few days.
>> BTRFS autodefrag is not working over the long term. That said BTRFS
>> itself is working far better than XFS on our cluster (noticeably better
>> latencies). As not having checksums wasn't an option we coded and are
>> using this:
>>
>> https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb
>>
>> This actually saved us from 2 faulty disk controllers which were
>> infrequently corrupting data in our cluster.
>>
>> Mandatory too for performance :
>> filestore btrfs snap = false
> This sounds interesting. For how long you use this method?

More than a year now. Since the beginning almost two years ago we always
had at least one or two BTRFS OSDs to test and compare to the XFS ones.
At the very beginning we had to recycle them regularly because their
performance degraded over time. This was not a problem as Ceph makes it
easy to move data around safely.
We only switched after both finding out that "filestore btrfs snap =
false" was mandatory (when true it creates large write spikes every
filestore sync interval) and that a custom defragmentation process was
needed to maintain performance over the long run.

>  What kind of workload do you have?

A dozen VMs using rbd through KVM built-in support. There are different
kinds of access patterns : a large PostgreSQL instance (75+GB on disk,
300+ tx/s with peaks of ~2000 with a mean of 50+ IO/s and peaks to 1000,
mostly writes), a small MySQL instance (hard to say : was very large but
we moved most of its content to PostgreSQL which left only a small
database for a proprietary tool and large ibdata* files with mostly
holes), a very large NFS server (~10 TB), lots of Ruby on Rails
applications and background workers.

On the whole storage system Ceph reports an average of 170 op/s with
peaks that can reach 3000.

>  How did you measure the performance and latency?

Every useful metric we can get is fed to a Zabbix server. Latency is
measured both by the kernel on each disk with the average time a request
stays in queue (number of IOs / accumulated wait time over a given
period : you can find these values in /sys/block/<dev>/stat) and at Ceph
level by monitoring the apply latency (we now have journals on SSD so
our commit latency is mostly limited by the available CPU).
The most interesting metric is the apply latency, block device latency
is useful to monitor to see how much the device itself is pushed and how
well read performs (apply latency only gives us the write side of the
story).

The behavior during backfills confirmed the latency benefits too : BTRFS
OSDs were less frequently involved in slow requests than the XFS ones.

>  What kernel do you use with btrfs?

4.4.6 currently (we just finished migrating all servers last week-end).
But the switch from XFS to BTRFS occurred with late 3.9 kernels IIRC.

I don't have measurements for this but when we switched from 4.1.15-r1
("-r1" is for Gentoo patches) to 4.4.6 we saw faster OSD startups
(including the initial filesystem mount). The only drawback with BTRFS
(if you don't count having to develop and run a custom defragmentation
scheduler) was the OSD startup times vs XFS. It was very slow when
starting from an unmounted filesystem at least until 4.1.x. This was not
really a problem as we don't restart OSDs often.

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com