SSD journal deployment experiences

daniel.vanderster@xxxxxxx (Dan van der Ster) · Thu, 4 Sep 2014 20:23:51 +0000

Hi Martin,

September 4 2014 10:07 PM, "Martin B Nielsen" <martin at unity3d.com> wrote: 
> Hi Dan,
> 
> We took a different approach (and our cluster is tiny compared to many others) - we have two pools;
> normal and ssd.
> 
> We use 14 disks in each osd-server; 8 platter and 4 ssd for ceph, and 2 ssd for OS/journals. We
> partitioned the two OS ssd as raid1 using about half the space for the OS and leaving the rest on
> each for 2x journals and unprovisioned. We've partitioned the OS disks to each hold 2x platter
> journals. On top of that our ssd pooled disks also hold 2x journals; their own + an additional from
> a platter disk. We have 8 osd-nodes.
> 
> So whenever an ssd fail we lose 2 osd (but never more).

Interesting ... you have quite a few SSDs there per box. I suppose my closest config would be 5 platter journals per SSD, plus a FileStore+journal to squeeze out every last IOP from our SSDs. That would take out 6 OSDs with a failure.

> We've had this system in production for ~1? year now and so far we've had 1 ssd and 2 platter disk
> fail. We run a couple of hundred vm-guests on it and use ~60TB.

Which SSD was that? 

> On a daily basis we avg. 30MB/sec r/w and ~600 iops so not very high usage. The times we lost disks
> we hardly noticed. All SSD (OS included) have a general utilization of <5%, platter disks near 10%.

We have peaks up to 7000 iops, but mostly between 4-5000. When we have 7000 iops the small write latency inches up to around 70ms :(

> We did a lot of initial testing about putting journals on the OS-ssd as well extra on the ssd-osd,
> but we didn't find much difference or high latencies as others have experienced. When/if we notice
> otherwise we'll prob. switch to pure ssd as journalholders.
> 
> We originally deployed using saltstack and even though we have automated replacing disks we still
> do it manually 'just to be sure'. It takes 5-10min to replace an old disk and get it backfilling,
> so I don't expect us to spend any time automating this.
> 
> Recovering 2 disks at once for us takes a long time but we've intentionally set backfilling low and
> it is not noticeable on the cluster when it happens.

Yeah 2 wouldn't be noticeable in our cluster even now. 24 _was_ noticeable, so I maybe 5 is doable.

Thanks for the input,

Dan

> Anyways, we have pretty low cluster usage but in our experience ssd seem to handle the constant
> load very well.
> 
> Cheers,
> Martin
> 
> On Thu, Sep 4, 2014 at 6:21 PM, Dan Van Der Ster <daniel.vanderster at cern.ch> wrote:
> 
>> Dear Cephalopods,
>> 
>> In a few weeks we will receive a batch of 200GB Intel DC S3700?s to augment our cluster, and I?d
>> like to hear your practical experience and discuss options how best to deploy these.
>> 
>> We?ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so they will become 20 OSDs +
> 4
>> SSDs per server. Until recently I?ve been planning to use the traditional deployment: 5 journal
>> partitions per SSD. But as SSD-day approaches, I growing less comfortable with the idea of 5 OSDs
>> going down every time an SSD fails, so perhaps there are better options out there.
>> 
>> Before getting into options, I?m curious about real reliability of these drives:
>> 
>> 1) How often are DC S3700's failing in your deployments?
>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the backfilling which
> results
>> from an SSD failure? Have you considered tricks like increasing the down out interval so
>> backfilling doesn?t happen in this case (leaving time for the SSD to be replaced)?
>> 
>> Beyond the usually 5 partitions deployment, is anyone running a RAID1 or RAID10 for the journals?
>> If so, are you using the raw block devices or formatting it and storing the journals as files on
>> the SSD array(s)? Recent discussions seem to indicate that XFS is just as fast as the block dev,
>> since these drives are so fast.
>> 
>> Next, I wonder how people with puppet/chef/? are handling the creation/re-creation of the SSD
>> devices. Are you just wiping and rebuilding all the dependent OSDs completely when the journal
> dev
>> fails? I?m not keen on puppetizing the re-creation of journals for OSDs...
>> 
>> We also have this crazy idea of failing over to a local journal file in case an SSD fails. In
> this
>> model, when an SSD fails we?d quickly create a new journal either on another SSD or on the local
>> OSD filesystem, then restart the OSDs before backfilling started. Thoughts?
>> 
>> Lastly, I would also consider using 2 of the SSDs in a data pool (with the other 2 SSDs to hold
> 20
>> journals ? probably in a RAID1 to avoid backfilling 10 OSDs when an SSD fails). If the 10-1 ratio
>> of SSDs would perform adequately, that?d give us quite a few SSDs to build a dedicated high-IOPS
>> pool.
>> 
>> I?d also appreciate any other suggestions/experiences which might be relevant.
>> 
>> Thanks!
>> Dan
>> 
>> -- Dan van der Ster || Data & Storage Services || CERN IT Department --
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com