Re: Ceph OSDs with bcache experience

Jan Schermer <jan@xxxxxxxxxxx> · Wed, 21 Oct 2015 11:25:13 +0200

> On 21 Oct 2015, at 09:11, Wido den Hollander <wido@xxxxxxxx> wrote:
> 
> On 10/20/2015 09:45 PM, Martin Millnert wrote:
>> The thing that worries me with your next-gen design (actually your current design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 12TB/day per 64TB total.  I guess use case dependant,  and perhaps 1:4 write read ratio is quite high in terms of writes as-is.
>> You're also throughput-limiting yourself to the pci-e bw of the NVME device (regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok of course in relative terms. NVRAM vs SSD here is simply a choice between wear (NVRAM as journal minimum), and cache hit probability (size).  
>> Interesting thought experiment anyway for me, thanks for sharing Wido.
>> /M
> 
> We are looking at the PC 3600DC 1.2TB, according to the specs from
> Intel: 10.95PBW
> 
> Like I mentioned in my reply to Mark, we are still running on 1Gbit and
> heading towards 10Gbit.
> 
> Bandwidth isn't really a issue in our cluster. During peak moments we
> average about 30k IOps through the cluster, but the TOTAL client I/O is
> just 1Gbit Read and Write. Sometimes a bit higher, but mainly small I/O.
> 
> Bandwidth-wise there is no need for 10Gbit, but we are doing it for the
> lower latency and thus more IOps.
> 
> Currently our S3700 SSDs are peaking at 50% utilization according to iostat.
> 
> After 2 years of operation the lowest Media_Wearout_Indicator we see is
> 33. On Intel SSDs this starts at 100 and counts down to 0. 0 indicating
> that the SSD is worn out.
> 
> So in 24 months we worn through 67% of the SSD. A quick calculation
> tells me we still have 12 months left on that SSD before it dies.

Could you maybe run isdct and compare what it says about expected lifetime? I think isdct will report a much longer lifetime than you expect.

For comparison one of my drives (S3610, 1.2TB) - this drive has 3 DWPD rating (~6.5PB written)

241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       1487714 <-- units of 32MB, that translates to ~47TB
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0 (maybe my smartdb needs updating, but this is what it says)
9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1008

If I extrapolate this blindly I would expect the SSD to reach it's TBW of 6.5PB in about 15 years.

But isdct says:
EnduranceAnalyzer: 46.02 Years

If I reverse it and calculate the endurance based on smart values, that would give the expected lifetime of over 18PB (which is not impossible at all), but isdct is a bit smarter and looks at what the current use pattern is. It's clearly not only about discarding the initial bursts when the drive was filled during backfilling because it's not that much, and all my S3610 drives indicate a similiar endurance of 40 years (+-10).

I'd trust isdct over extrapolated SMART values - I think the SSD will actually switch to a different calculation scheme when it reaches certain lifepoint (when all reserve blocks are used, or when first cells start to die...) which is why there's a discrepancy.

Jan

> 
> But this is the lowest, other SSDs which were taken into production at
> the same moment are ranging between 36 and 61.
> 
> Also, when buying the 1.2TB SSD we'll probably allocate only 1TB of the
> SSD and leave 200GB of cells spare so the Wear-Leveling inside the SSD
> has some spare cells.
> 
> Wido
> 
>> 
>> -------- Original message --------
>> From: Wido den Hollander <wido@xxxxxxxx> 
>> Date: 20/10/2015  16:00  (GMT+01:00) 
>> To: ceph-users <ceph-users@xxxxxxxx> 
>> Subject:  Ceph OSDs with bcache experience 
>> 
>> Hi,
>> 
>> In the "newstore direction" thread on ceph-devel I wrote that I'm using
>> bcache in production and Mark Nelson asked me to share some details.
>> 
>> Bcache is running in two clusters now that I manage, but I'll keep this
>> information to one of them (the one at PCextreme behind CloudStack).
>> 
>> In this cluster has been running for over 2 years now:
>> 
>> epoch 284353
>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
>> created 2013-09-23 11:06:11.819520
>> modified 2015-10-20 15:27:48.734213
>> 
>> The system consists out of 39 hosts:
>> 
>> 2U SuperMicro chassis:
>> * 80GB Intel SSD for OS
>> * 240GB Intel S3700 SSD for Journaling + Bcache
>> * 6x 3TB disk
>> 
>> This isn't the newest hardware. The next batch of hardware will be more
>> disks per chassis, but this is it for now.
>> 
>> All systems were installed with Ubuntu 12.04, but they are all running
>> 14.04 now with bcache.
>> 
>> The Intel S3700 SSD is partitioned with a GPT label:
>> - 5GB Journal for each OSD
>> - 200GB Partition for bcache
>> 
>> root@ceph11:~# df -h|grep osd
>> /dev/bcache0    2.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
>> /dev/bcache1    2.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
>> /dev/bcache2    2.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
>> /dev/bcache3    2.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
>> /dev/bcache4    2.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
>> /dev/bcache5    2.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
>> root@ceph11:~#
>> 
>> root@ceph11:~# lsb_release -a
>> No LSB modules are available.
>> Distributor ID:	Ubuntu
>> Description:	Ubuntu 14.04.3 LTS
>> Release:	14.04
>> Codename:	trusty
>> root@ceph11:~# uname -r
>> 3.19.0-30-generic
>> root@ceph11:~#
>> 
>> "apply_latency": {
>>    "avgcount": 2985023,
>>    "sum": 226219.891559000
>> }
>> 
>> What did we notice?
>> - Less spikes on the disk
>> - Lower commit latencies on the OSDs
>> - Almost no 'slow requests' during backfills
>> - Cache-hit ratio of about 60%
>> 
>> Max backfills and recovery active are both set to 1 on all OSDs.
>> 
>> For the next generation hardware we are looking into using 3U chassis
>> with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't
>> tested those yet, so nothing to say about it.
>> 
>> The current setup is 200GB of cache for 18TB of disks. The new setup
>> will be 1200GB for 64TB, curious to see what that does.
>> 
>> Our main conclusion however is that it does smoothen the I/O-pattern
>> towards the disks and that gives a overall better response of the disks.
>> 
>> Wido
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com