Hi, Le 29/06/2016 18:33, Stefan Priebe - Profihost AG a écrit : >> Am 28.06.2016 um 09:43 schrieb Lionel Bouton <lionel-subscription@xxxxxxxxxxx>: >> >> Hi, >> >> Le 28/06/2016 08:34, Stefan Priebe - Profihost AG a écrit : >>> [...] >>> Yes but at least BTRFS is still not working for ceph due to >>> fragmentation. I've even tested a 4.6 kernel a few weeks ago. But it >>> doubles it's I/O after a few days. >> BTRFS autodefrag is not working over the long term. That said BTRFS >> itself is working far better than XFS on our cluster (noticeably better >> latencies). As not having checksums wasn't an option we coded and are >> using this: >> >> https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb >> >> This actually saved us from 2 faulty disk controllers which were >> infrequently corrupting data in our cluster. >> >> Mandatory too for performance : >> filestore btrfs snap = false > This sounds interesting. For how long you use this method? More than a year now. Since the beginning almost two years ago we always had at least one or two BTRFS OSDs to test and compare to the XFS ones. At the very beginning we had to recycle them regularly because their performance degraded over time. This was not a problem as Ceph makes it easy to move data around safely. We only switched after both finding out that "filestore btrfs snap = false" was mandatory (when true it creates large write spikes every filestore sync interval) and that a custom defragmentation process was needed to maintain performance over the long run. > What kind of workload do you have? A dozen VMs using rbd through KVM built-in support. There are different kinds of access patterns : a large PostgreSQL instance (75+GB on disk, 300+ tx/s with peaks of ~2000 with a mean of 50+ IO/s and peaks to 1000, mostly writes), a small MySQL instance (hard to say : was very large but we moved most of its content to PostgreSQL which left only a small database for a proprietary tool and large ibdata* files with mostly holes), a very large NFS server (~10 TB), lots of Ruby on Rails applications and background workers. On the whole storage system Ceph reports an average of 170 op/s with peaks that can reach 3000. > How did you measure the performance and latency? Every useful metric we can get is fed to a Zabbix server. Latency is measured both by the kernel on each disk with the average time a request stays in queue (number of IOs / accumulated wait time over a given period : you can find these values in /sys/block/<dev>/stat) and at Ceph level by monitoring the apply latency (we now have journals on SSD so our commit latency is mostly limited by the available CPU). The most interesting metric is the apply latency, block device latency is useful to monitor to see how much the device itself is pushed and how well read performs (apply latency only gives us the write side of the story). The behavior during backfills confirmed the latency benefits too : BTRFS OSDs were less frequently involved in slow requests than the XFS ones. > What kernel do you use with btrfs? 4.4.6 currently (we just finished migrating all servers last week-end). But the switch from XFS to BTRFS occurred with late 3.9 kernels IIRC. I don't have measurements for this but when we switched from 4.1.15-r1 ("-r1" is for Gentoo patches) to 4.4.6 we saw faster OSD startups (including the initial filesystem mount). The only drawback with BTRFS (if you don't count having to develop and run a custom defragmentation scheduler) was the OSD startup times vs XFS. It was very slow when starting from an unmounted filesystem at least until 4.1.x. This was not really a problem as we don't restart OSDs often. Best regards, Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com