Hi Martin, September 4 2014 10:07 PM, "Martin B Nielsen" <martin at unity3d.com> wrote: > Hi Dan, > > We took a different approach (and our cluster is tiny compared to many others) - we have two pools; > normal and ssd. > > We use 14 disks in each osd-server; 8 platter and 4 ssd for ceph, and 2 ssd for OS/journals. We > partitioned the two OS ssd as raid1 using about half the space for the OS and leaving the rest on > each for 2x journals and unprovisioned. We've partitioned the OS disks to each hold 2x platter > journals. On top of that our ssd pooled disks also hold 2x journals; their own + an additional from > a platter disk. We have 8 osd-nodes. > > So whenever an ssd fail we lose 2 osd (but never more). Interesting ... you have quite a few SSDs there per box. I suppose my closest config would be 5 platter journals per SSD, plus a FileStore+journal to squeeze out every last IOP from our SSDs. That would take out 6 OSDs with a failure. > We've had this system in production for ~1? year now and so far we've had 1 ssd and 2 platter disk > fail. We run a couple of hundred vm-guests on it and use ~60TB. Which SSD was that? > On a daily basis we avg. 30MB/sec r/w and ~600 iops so not very high usage. The times we lost disks > we hardly noticed. All SSD (OS included) have a general utilization of <5%, platter disks near 10%. We have peaks up to 7000 iops, but mostly between 4-5000. When we have 7000 iops the small write latency inches up to around 70ms :( > We did a lot of initial testing about putting journals on the OS-ssd as well extra on the ssd-osd, > but we didn't find much difference or high latencies as others have experienced. When/if we notice > otherwise we'll prob. switch to pure ssd as journalholders. > > We originally deployed using saltstack and even though we have automated replacing disks we still > do it manually 'just to be sure'. It takes 5-10min to replace an old disk and get it backfilling, > so I don't expect us to spend any time automating this. > > Recovering 2 disks at once for us takes a long time but we've intentionally set backfilling low and > it is not noticeable on the cluster when it happens. Yeah 2 wouldn't be noticeable in our cluster even now. 24 _was_ noticeable, so I maybe 5 is doable. Thanks for the input, Dan > Anyways, we have pretty low cluster usage but in our experience ssd seem to handle the constant > load very well. > > Cheers, > Martin > > On Thu, Sep 4, 2014 at 6:21 PM, Dan Van Der Ster <daniel.vanderster at cern.ch> wrote: > >> Dear Cephalopods, >> >> In a few weeks we will receive a batch of 200GB Intel DC S3700?s to augment our cluster, and I?d >> like to hear your practical experience and discuss options how best to deploy these. >> >> We?ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so they will become 20 OSDs + > 4 >> SSDs per server. Until recently I?ve been planning to use the traditional deployment: 5 journal >> partitions per SSD. But as SSD-day approaches, I growing less comfortable with the idea of 5 OSDs >> going down every time an SSD fails, so perhaps there are better options out there. >> >> Before getting into options, I?m curious about real reliability of these drives: >> >> 1) How often are DC S3700's failing in your deployments? >> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the backfilling which > results >> from an SSD failure? Have you considered tricks like increasing the down out interval so >> backfilling doesn?t happen in this case (leaving time for the SSD to be replaced)? >> >> Beyond the usually 5 partitions deployment, is anyone running a RAID1 or RAID10 for the journals? >> If so, are you using the raw block devices or formatting it and storing the journals as files on >> the SSD array(s)? Recent discussions seem to indicate that XFS is just as fast as the block dev, >> since these drives are so fast. >> >> Next, I wonder how people with puppet/chef/? are handling the creation/re-creation of the SSD >> devices. Are you just wiping and rebuilding all the dependent OSDs completely when the journal > dev >> fails? I?m not keen on puppetizing the re-creation of journals for OSDs... >> >> We also have this crazy idea of failing over to a local journal file in case an SSD fails. In > this >> model, when an SSD fails we?d quickly create a new journal either on another SSD or on the local >> OSD filesystem, then restart the OSDs before backfilling started. Thoughts? >> >> Lastly, I would also consider using 2 of the SSDs in a data pool (with the other 2 SSDs to hold > 20 >> journals ? probably in a RAID1 to avoid backfilling 10 OSDs when an SSD fails). If the 10-1 ratio >> of SSDs would perform adequately, that?d give us quite a few SSDs to build a dedicated high-IOPS >> pool. >> >> I?d also appreciate any other suggestions/experiences which might be relevant. >> >> Thanks! >> Dan >> >> -- Dan van der Ster || Data & Storage Services || CERN IT Department -- >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com