Re: SSD recommendations for OSD journals

Leen Besselink <leen@xxxxxxxxxxxxxxxxx> · Mon, 22 Jul 2013 00:04:46 +0200

On Mon, Jul 22, 2013 at 08:45:07AM +1100, Mikaël Cluseau wrote:
> On 22/07/2013 08:03, Charles 'Boyo wrote:
> >Counting on the kernel's cache, it appears I will be best served
> >purchasing write-optimized SSDs?
> >Can you share any information on the SSD you are using, is it PCIe
> >connected?
> 
> We are on a standard SAS bus so any SSD going to 500MB/s and being
> stable on the long run (we use 60G Intel 520), you do not need a lot
> of space for the journal (5G per drive is far enough on commodity
> hardware).
> 
> >Another question, since the intention of this storage cluster is
> >relatively cheap storage on commodity hardware, what's the balance
> >between cheap SSDs and reliability since journal failure might
> >result in data loss or will such an event just 'down' the affected
> >OSDs?
> 

When you do a write to Ceph, one OSD (I believe this is the master for a
certain part of the data, an object) receives the write and distributed
the copies to other OSD (as much as is configured, like: min size=2 size=3)
when writes are done on all those OSDs it will confirm the write to the
client. So if one OSD failes, other OSDs will have that data. The master
will have to make sure an other copy is created somewhere else.

So I don't see a reason for data loss if you lose one journal. There
will be a lot of copying of data though and slow things down.

> A journal failure will fail your OSDs (from what I've understood,
> you'll have to rebuild them). But SSDs are very deterministic, so
> monitor them :
> 
> # smartctl -A /dev/sdd
> [..]
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
> [..]
> 232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail
> Always       -       0
> 233 Media_Wearout_Indicator 0x0032   093   093   000    Old_age
> Always       -       0
> 
> And don't put too many OSDs on one SSD (I set a rule to not go over
> 4 for 1).
> 

When the SSD is large enough and yournals don't take up all the space,
you can also leave part of the SSD unpartitioned. This will allow the SSD
the fail much later.

> >On a similar note, I am using XFS on the OSDs which also journals,
> >does this affect performance in any way?
> 
> You want this journal for consistency ;) I don't know exactly the
> impact, but since we use spinning drives, the most important factor
> is that ceph, with a journal on SSD, does a lot of sequential
> writes, avoiding most seeks.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com