Re: What happens if all replica OSDs journals are broken?

Wojciech Kobryń <w.kobryn@xxxxxxxxx> · Tue, 13 Dec 2016 09:34:20 +0000

Hi,
Recently I lost 5 out of 12 journal OSDs (2xSDD failure at one time). size=2, min_size=1. I know, should rather be 3/2, I have plans to switch to it asap.

CEPH started to throw many failures, then I removed these two SSDs, and recreated journal OSD from scratch. In my case, all data on main OSD was still there, but Ceph tried to do the best  it could to disable write to OSDs and keep the data consistency.
After re-creating all 5 journal OSD on another HDD, recovery+backfill started to work. After couple of hours it discovered 7 "unfound" objects (6 in data OSD and 1 hitset in cache tier). I found out what files were affected, and hoped to not loose important data. Then after trying to revert these 6 unfound object to the previous version, but if was unsuccessfull, so I just deleted them. Most important problem we found was that single hitset file that we couldn't just delete, and instead we took some another hitset file and copied it onto missing one. Then cache tier recognized this hitset and invalidated it, which allowed all the backfill+recovery to finish, and finally entire Ceph cluster went back to HEALTH_OK. Finally I run fsck wherever these 6 unfound files could affect, and fortunately, these lost blocks were not important and contained empty data, so fsck recovery was successfull in all cases. That was very stressfull time :)

-- 
Wojtek

wt., 13.12.2016 o 00:01 użytkownik Christian Balzer <chibi@xxxxxxx> napisał:
On Mon, 12 Dec 2016 22:41:41 +0100 Kevin Olbrich wrote:

> Hi,

>

> just in case: What happens when all replica journal SSDs are broken at once?

>

That would be bad, as in BAD.

In theory you just "lost" all the associated OSDs and their data.

In practice everything but in the in-flight data at the time is still on

the actual OSDs (HDDs), but it's inconsistent and inaccessible as far as

Ceph is concerned.

So with some trickery and an experienced data-recovery Ceph consultant you

_may_ get things running with limited data loss/corruption, but that's

speculation and may be wishful thinking on my part.

Another data point to deploy only well known/monitored/trusted SSDs and

have a 3x replication.

> The PGs most likely will be stuck inactive but as I read, the journals just

> need to be replaced (http://ceph.com/planet/ceph-recover-osds-after-ssd-

> journal-failure/).

>

> Does this also work in this case?

>

Not really, no.

The above works by having still a valid state and operational OSDs from

which the "broken" one can recover.

Christian

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com