Re: What happens if all replica OSDs journals are broken?

Kevin Olbrich <ko@xxxxxxx> · Wed, 14 Dec 2016 20:09:34 +0100

2016-12-14 2:37 GMT+01:00 Christian Balzer <chibi@xxxxxxx>:

Hello,

Hi!

On Wed, 14 Dec 2016 00:06:14 +0100 Kevin Olbrich wrote:

> Ok, thanks for your explanation!

> I read those warnings about size 2 + min_size 1 (we are using ZFS as RAID6,

> called zraid2) as OSDs.

>

This is similar to my RAID6 or RAID10 backed OSDs with regards to having

very resilient, extremely unlikely to fail OSDs.

This was our intention (unlikely to fail, data security > performance).
We use Ceph for OpenStack (Cinder RBD).

As such a Ceph replication of 2 with min_size is a calculated risk,

acceptable for me on others in certain use cases.

This is also with very few (2-3) journals per SSD.

We are running 14x 500G RAID6 ZFS-RAID per Host (1x journal, 1x OSD, 32GB RAM).
The ZFS pools use L2ARC-Cache on Samsung 850 PRO's 128GB.
Hint: Was a bad idea, would have better split the ZFS pools. (ZFS performance was very good but double parity with 4k random on sync with ceph takes very long, resulting in XXX requests blocked more than 32 seconds).
Currently I am waiting for a lab cluster to test "osd op threads" for these single OSD hosts.

If:

1. Your journal SSDs are well trusted and monitored (Intel DC S36xx, 37xx)

Indeed Intel DC P3700 400GB for Ceph. We had Samsung 850 PRO before I leard 4k random while DSYNC is a very bad idea... ;-)

2. Your failure domain represented by a journal SSD is small enough

(meaning that replicating the lost OSDs can be done quickly) 

OSDs are rather large but we are "just" using 8 TB (size 2) in the whole cluster (OSD is 24% full).
Before we moved from infernalis to jewel, a recovery from an OSD which was offline for 8 hours took approx. one hour to be back in sync.

it may be an acceptable risk for you as well.

We got reliable backups in the past but downtime is a greater problem.

> Time to raise replication!

>

If you can afford that (money, space, latency), definitely go for it.

It's more the double journal failure which scares me compared to the OSD itself (as ZFS was very reliable in the past).

Kevin

Christian

> Kevin

>

> 2016-12-13 0:00 GMT+01:00 Christian Balzer <chibi@xxxxxxx>:

>

> > On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote:

> >

> > > Hi,

> > >

> > > just in case: What happens when all replica journal SSDs are broken at

> > once?

> > >

> > That would be bad, as in BAD.

> >

> > In theory you just "lost" all the associated OSDs and their data.

> >

> > In practice everything but in the in-flight data at the time is still on

> > the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as

> > Ceph is concerned.

> >

> > So with some trickery and an experienced data-recovery Ceph consultant you

> > _may_ get things running with limited data loss/corruption, but that's

> > speculation and may be wishful thinking on my part.

> >

> > Another data point to deploy only well known/monitored/trusted SSDs and

> > have a 3x replication.

> >

> > > The PGs most likely will be stuck inactive but as I read, the journals

> > just

> > > need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd-

> > > journal-failure/).

> > >

> > > Does this also work in this case?

> > >

> > Not really, no.

> >

> > The above works by having still a valid state and operational OSDs from

> > which the "broken" one can recover.

> >

> > Christian

> > --

> > Christian Balzer        Network/Systems Engineer

> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

> > http://www.gol.com/

> >

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com