Re: Bluestore with SSD-backed DBs; what if the SSD fails?

Wido den Hollander <wido@xxxxxxxx> · Wed, 25 Oct 2017 10:21:08 +0200 (CEST)

> Op 25 oktober 2017 om 5:58 schreef Christian Sarrasin <c.nntp@xxxxxxxxxxxxxxxxxx>:
> 
> 
> I'm planning to migrate an existing Filestore cluster with (SATA)
> SSD-based journals fronting multiple HDD-hosted OSDs - should be a
> common enough setup.  So I've been trying to parse various contributions
> here and Ceph devs' blog posts (for which, thanks!)
> 
> Seems the best way to repurpose that hardware would basically be to use
> those SSDs as DB partitions for Bluestore.
> 
> The one thing I'm still wondering about is failure domains.  With
> Filestore and SSD-backed journals, an SSD failure would kill writes but
> OSDs were otherwise still whole.  Replacing the failed SSD quickly would
> get you back on your feet with relatively little data movement.
> 

Not true. If you loose your OSD's journal with FileStore without a clean shutdown of the OSD you loose the OSD. You'd have to rebalance the complete OSD.

> Hence the question: what happens if a SSD that contains several
> partitions hosting DBs for multiple OSDs fails?  Is OSDs data still
> recoverable upon replacing the SSD or is the entire lot basically toast?
> 

It's lost. You need both the WAL+DB for a BlueStore OSD. So if the SSD dies where those reside on you have to wipe the OSDs and rebuild them.

> If so, might this warrant revisiting the old debate about RAID-1'ing
> SSDs in such as setup?  Or I suppose at least not being too ambitious
> with the number of DBs hosted on a single SSD?
> 

I would not use RAID-1. Let's say you have 8 OSDs in a machine. Put 4 OSDs on each SSD. If you loose the SSD you loose 4 OSDs.

Don't make this a too big deal. Make sure you failure domains are small enough so that your system can handle loosing a OSD.

If your system can't handle a OSD rebuild you already have a problem in your Ceph cluster.

Instead of using 8TB drives consider using 4TB or even 2TB but have more spindles. That way the impact of a single disk rebuild is less.

Wido

> Thoughts much appreciated!
> 
> PS: It's not fully clear whether a separate WAL partition is useful in
> that setup?  Sage posted about a month back: "[WAL] will always just
> spill over onto the next fastest device (wal -> db -> main)".  I'll take
> that as meaning that a separate WAL partition would be
> counter-productive if hosted on the same SSD.  Please correct me if I'm
> wrong?
> 
> Cheers
> Christian
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com