Re: Ceph OSDs with bcache experience

Wido den Hollander <wido@xxxxxxxx> · Fri, 6 Nov 2015 09:30:09 +0100

On 11/05/2015 11:03 PM, Michal Kozanecki wrote:
> Why did you guys go with partitioning the SSD for ceph journals, instead of just using the whole SSD for bcache and leaving the journal on the filesystem (which itself is ontop bcache)? Was there really a benefit to separating the journals from the bcache fronted HDDs?
> 
> I ask because it has been shown in the past that separating the journal on SSD based pools doesn't really do much.
> 

Well, the I/O for the journal by-passes bcache completely in this case.
The less code the I/O travels through the better we figured.

We didn't try with the Journal on bcache. This works for us so we didn't
mind testing anything different.

Wido

> Michal Kozanecki | Linux Administrator | mkozanecki@xxxxxxxxxx
> 
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Wido den Hollander
> Sent: October-28-15 5:49 AM
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Ceph OSDs with bcache experience
> 
> 
> 
> On 21-10-15 15:30, Mark Nelson wrote:
>>
>>
>> On 10/21/2015 01:59 AM, Wido den Hollander wrote:
>>> On 10/20/2015 07:44 PM, Mark Nelson wrote:
>>>> On 10/20/2015 09:00 AM, Wido den Hollander wrote:
>>>>> Hi,
>>>>>
>>>>> In the "newstore direction" thread on ceph-devel I wrote that I'm 
>>>>> using bcache in production and Mark Nelson asked me to share some details.
>>>>>
>>>>> Bcache is running in two clusters now that I manage, but I'll keep 
>>>>> this information to one of them (the one at PCextreme behind CloudStack).
>>>>>
>>>>> In this cluster has been running for over 2 years now:
>>>>>
>>>>> epoch 284353
>>>>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
>>>>> created 2013-09-23 11:06:11.819520
>>>>> modified 2015-10-20 15:27:48.734213
>>>>>
>>>>> The system consists out of 39 hosts:
>>>>>
>>>>> 2U SuperMicro chassis:
>>>>> * 80GB Intel SSD for OS
>>>>> * 240GB Intel S3700 SSD for Journaling + Bcache
>>>>> * 6x 3TB disk
>>>>>
>>>>> This isn't the newest hardware. The next batch of hardware will be 
>>>>> more disks per chassis, but this is it for now.
>>>>>
>>>>> All systems were installed with Ubuntu 12.04, but they are all 
>>>>> running
>>>>> 14.04 now with bcache.
>>>>>
>>>>> The Intel S3700 SSD is partitioned with a GPT label:
>>>>> - 5GB Journal for each OSD
>>>>> - 200GB Partition for bcache
>>>>>
>>>>> root@ceph11:~# df -h|grep osd
>>>>> /dev/bcache0    2.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
>>>>> /dev/bcache1    2.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
>>>>> /dev/bcache2    2.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
>>>>> /dev/bcache3    2.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
>>>>> /dev/bcache4    2.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
>>>>> /dev/bcache5    2.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
>>>>> root@ceph11:~#
>>>>>
>>>>> root@ceph11:~# lsb_release -a
>>>>> No LSB modules are available.
>>>>> Distributor ID:    Ubuntu
>>>>> Description:    Ubuntu 14.04.3 LTS
>>>>> Release:    14.04
>>>>> Codename:    trusty
>>>>> root@ceph11:~# uname -r
>>>>> 3.19.0-30-generic
>>>>> root@ceph11:~#
>>>>>
>>>>> "apply_latency": {
>>>>>       "avgcount": 2985023,
>>>>>       "sum": 226219.891559000
>>>>> }
>>>>>
>>>>> What did we notice?
>>>>> - Less spikes on the disk
>>>>> - Lower commit latencies on the OSDs
>>>>> - Almost no 'slow requests' during backfills
>>>>> - Cache-hit ratio of about 60%
>>>>>
>>>>> Max backfills and recovery active are both set to 1 on all OSDs.
>>>>>
>>>>> For the next generation hardware we are looking into using 3U 
>>>>> chassis with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, 
>>>>> but we haven't tested those yet, so nothing to say about it.
>>>>>
>>>>> The current setup is 200GB of cache for 18TB of disks. The new 
>>>>> setup will be 1200GB for 64TB, curious to see what that does.
>>>>>
>>>>> Our main conclusion however is that it does smoothen the 
>>>>> I/O-pattern towards the disks and that gives a overall better 
>>>>> response of the disks.
>>>>
>>>> Hi Wido, thanks for the big writeup!  Did you guys happen to do any 
>>>> benchmarking?  I think Xiaoxi looked at flashcache a while back but 
>>>> had mixed results if I remember right.  It would be interesting to 
>>>> know how bcache is affecting performance in different scenarios.
>>>>
>>>
>>> No, we didn't do any benchmarking. Initially this cluster was build 
>>> for just the RADOS Gateway, so we went for 2Gbit (2x 1Gbit) per 
>>> machine. 90% is still Gbit networking and we are in the process of 
>>> upgrading it all to 10Gbit.
>>>
>>> Since the 1Gbit network latency is about 4 times higher then 10Gbit 
>>> we aren't really benchmarking the cluster.
>>>
>>> What counts for us most is that we can do recovery operations without 
>>> any slow requests.
>>>
>>> Before bcache we saw disks spike to 100% busy while a backfill was busy.
>>> Now bcache smoothens this and we see peaks of maybe 70%, but that's it.
>>
>> In the testing I was doing to figure out our new lab hardware, I was 
>> seeing SSDs handle recovery dramatically better than spinning disks as 
>> well during cephtestrados runs.  It might be worth digging in to see 
>> what the IO patterns look like.  In the mean time though, it's very 
>> interesting that bcache helps in this case so much.  Good to know!
>>
> 
> To add to this. We still had to enable hashpspools on a few pools, so we did. The degradation went to 39% on the cluster and it has been recovering for over 48 hours now.
> 
> Not a single slow request while we had the OSD complaint time set to 5 seconds. After setting this to 0.5 seconds we saw some slow requests, but nothing dramatic.
> 
> For us bcache works really great with spinning disks.
> 
> Wido
> 
>>>
>>>> Thanks,
>>>> Mark
>>>>
>>>>>
>>>>> Wido
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com