Re: Ceph OSD on Hardware RAID

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In addition to the points that others made so well:

- When using parity RAID, eg. RAID5 to create OSD devices, one reduces aggregate write speed — specially if using HDD’s — due to write amplification.

- If using parity or replicated RAID, one might semi-reasonably get away with reducing Ceph’s replication count from the default of 3 to 2, as losing an OSD will be fairly uncommon.  Your raw:usable ratio is generally still going to be bad, though.

- If you do Ceph EC on top of RAID5/6, you’re doing parity *Twice*.  (ugh)

- I’ve had to run clusters built on HBA RAID0 volumes for several reasons, including that those who came before me didn’t understand that LSI added JBOD passthrough to the 9266/9271 sometime around 2013 or early 2014.  And/or they thought the FBWC would improve performance.  Each drive was wrapped in a one-disk RAID0 volume.  Maybe the HBA’s onboard cache helpd with reads — unlike HP’s I couldn’t tell if there were fractions allocated to reads and to writes.  There were multiple hassles with this approach:

— Hardware issues that broke write cache flushing across reset / power events.  I had to swap out hundreds of affected cards via a field notice.

— LSI’s firmware and storcli utility had a bug that caused the on-HDD volatile cache to be used for writes despite what the VD’s policy of “Disk Default’ was supposed to do.  This caused corruption across power events.

— The interposing RAID layer confounded the data reported by iostat, and I’m told that it introduces latency.

— The interposing RAID layer confounded direct inspection of drives by smartctl, the '-d megaraid’ approach notwithstanding.  The latter worked on some models but not others.

— The RAID-capable HBAs were almost as unreliable as HDD’s themselves.  Heck, maybe even worse given the population.  The supercap connectors were prone to seating issues, and there was the quite gas gauge firmware issue. 

— LSI firmware bug that would not allow discarding of a failed drive’s preserved cache at reboot

— The entire HBA locking up for multiple seconds if one ran “storcli /c0 show all”.  Oh and PCI Gen 2 vs Gen 3 negotation failures.

— The VD layer complicated all aspects of management: finding the proper slot when replacing a failed drive, having to recreate a new VD when redeploying, etc.

In short, for most people the complexity just isn’t worth it.  All the extra $ the RAID-capable HBA’s cost in CapEx, in service impact when they degraded or failed, in engineer time spent futzing with them — would have been more effectively spent deploying SFF chassis instead of LFF, especially with SSDs instead of HDDs.

Maybe Areca units would have resulted in a better experience, but I didn’t have a choice.  Don’t let dogfood undermine your whole operation :-x

— aad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux