Re: Ceph OSDs with bcache experience

Wido den Hollander <wido@xxxxxxxx> · Wed, 28 Oct 2015 10:49:26 +0100

On 21-10-15 15:30, Mark Nelson wrote:
> 
> 
> On 10/21/2015 01:59 AM, Wido den Hollander wrote:
>> On 10/20/2015 07:44 PM, Mark Nelson wrote:
>>> On 10/20/2015 09:00 AM, Wido den Hollander wrote:
>>>> Hi,
>>>>
>>>> In the "newstore direction" thread on ceph-devel I wrote that I'm using
>>>> bcache in production and Mark Nelson asked me to share some details.
>>>>
>>>> Bcache is running in two clusters now that I manage, but I'll keep this
>>>> information to one of them (the one at PCextreme behind CloudStack).
>>>>
>>>> In this cluster has been running for over 2 years now:
>>>>
>>>> epoch 284353
>>>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
>>>> created 2013-09-23 11:06:11.819520
>>>> modified 2015-10-20 15:27:48.734213
>>>>
>>>> The system consists out of 39 hosts:
>>>>
>>>> 2U SuperMicro chassis:
>>>> * 80GB Intel SSD for OS
>>>> * 240GB Intel S3700 SSD for Journaling + Bcache
>>>> * 6x 3TB disk
>>>>
>>>> This isn't the newest hardware. The next batch of hardware will be more
>>>> disks per chassis, but this is it for now.
>>>>
>>>> All systems were installed with Ubuntu 12.04, but they are all running
>>>> 14.04 now with bcache.
>>>>
>>>> The Intel S3700 SSD is partitioned with a GPT label:
>>>> - 5GB Journal for each OSD
>>>> - 200GB Partition for bcache
>>>>
>>>> root@ceph11:~# df -h|grep osd
>>>> /dev/bcache0    2.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
>>>> /dev/bcache1    2.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
>>>> /dev/bcache2    2.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
>>>> /dev/bcache3    2.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
>>>> /dev/bcache4    2.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
>>>> /dev/bcache5    2.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
>>>> root@ceph11:~#
>>>>
>>>> root@ceph11:~# lsb_release -a
>>>> No LSB modules are available.
>>>> Distributor ID:    Ubuntu
>>>> Description:    Ubuntu 14.04.3 LTS
>>>> Release:    14.04
>>>> Codename:    trusty
>>>> root@ceph11:~# uname -r
>>>> 3.19.0-30-generic
>>>> root@ceph11:~#
>>>>
>>>> "apply_latency": {
>>>>       "avgcount": 2985023,
>>>>       "sum": 226219.891559000
>>>> }
>>>>
>>>> What did we notice?
>>>> - Less spikes on the disk
>>>> - Lower commit latencies on the OSDs
>>>> - Almost no 'slow requests' during backfills
>>>> - Cache-hit ratio of about 60%
>>>>
>>>> Max backfills and recovery active are both set to 1 on all OSDs.
>>>>
>>>> For the next generation hardware we are looking into using 3U chassis
>>>> with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we
>>>> haven't
>>>> tested those yet, so nothing to say about it.
>>>>
>>>> The current setup is 200GB of cache for 18TB of disks. The new setup
>>>> will be 1200GB for 64TB, curious to see what that does.
>>>>
>>>> Our main conclusion however is that it does smoothen the I/O-pattern
>>>> towards the disks and that gives a overall better response of the
>>>> disks.
>>>
>>> Hi Wido, thanks for the big writeup!  Did you guys happen to do any
>>> benchmarking?  I think Xiaoxi looked at flashcache a while back but had
>>> mixed results if I remember right.  It would be interesting to know how
>>> bcache is affecting performance in different scenarios.
>>>
>>
>> No, we didn't do any benchmarking. Initially this cluster was build for
>> just the RADOS Gateway, so we went for 2Gbit (2x 1Gbit) per machine. 90%
>> is still Gbit networking and we are in the process of upgrading it all
>> to 10Gbit.
>>
>> Since the 1Gbit network latency is about 4 times higher then 10Gbit we
>> aren't really benchmarking the cluster.
>>
>> What counts for us most is that we can do recovery operations without
>> any slow requests.
>>
>> Before bcache we saw disks spike to 100% busy while a backfill was busy.
>> Now bcache smoothens this and we see peaks of maybe 70%, but that's it.
> 
> In the testing I was doing to figure out our new lab hardware, I was
> seeing SSDs handle recovery dramatically better than spinning disks as
> well during cephtestrados runs.  It might be worth digging in to see
> what the IO patterns look like.  In the mean time though, it's very
> interesting that bcache helps in this case so much.  Good to know!
> 

To add to this. We still had to enable hashpspools on a few pools, so we
did. The degradation went to 39% on the cluster and it has been
recovering for over 48 hours now.

Not a single slow request while we had the OSD complaint time set to 5
seconds. After setting this to 0.5 seconds we saw some slow requests,
but nothing dramatic.

For us bcache works really great with spinning disks.

Wido

>>
>>> Thanks,
>>> Mark
>>>
>>>>
>>>> Wido
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com