Re: Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

Christian Eichelmann <christian.eichelmann@xxxxxxxx> · Fri, 09 Jan 2015 10:43:20 +0100

Hi Lionel,

we have a ceph cluster with in sum about 1PB, 12 OSDs with 60 Disks,
devided into 4 racks in 2 rooms, all connected with a dedicated 10G
cluster network. Of course with a replication level of 3.

We did about 9 Month intensive testing. Just like you, we were never
experiences that kind of problems before. And incomplete PG was
recovering as soon as at least one OSD holding a copy of it came back up.

We still don't know what caused this specific error, but at no point
there were more than two hosts down at the same time. Our pool has a
min_size of 1. And after everything was up again, we had completely LOST
2 of 3 pg copies (the directories on the OSDs were empty) and the third
copy was obvioulsy broken, because even manually injecting this pg into
the other osds didn't changed anything.

My main problem here is, that with even one incomplete PG your pool is
rendered unusable. And there is currently no way to make ceph forget
about the data of this pg and create it as an empty one. So the only way
to make this pool usable again is to loose all your data in there. Which
for me is just not acceptable.

Regards,
Christian

Am 07.01.2015 21:10, schrieb Lionel Bouton:
> On 12/30/14 16:36, Nico Schottelius wrote:
>> Good evening,
>>
>> we also tried to rescue data *from* our old / broken pool by map'ing the
>> rbd devices, mounting them on a host and rsync'ing away as much as
>> possible.
>>
>> However, after some time rsync got completly stuck and eventually the
>> host which mounted the rbd mapped devices decided to kernel panic at
>> which time we decided to drop the pool and go with a backup.
>>
>> This story and the one of Christian makes me wonder:
>>
>>     Is anyone using ceph as a backend for qemu VM images in production?
> 
> Yes with Ceph 0.80.5 since September after extensive testing over
> several months (including an earlier version IIRC) and some hardware
> failure simulations. We plan to upgrade one storage host and one monitor
> to 0.80.7 to validate this version over several months too before
> migrating the others.
> 
>>
>> And:
>>
>>     Has anyone on the list been able to recover from a pg incomplete /
>>     stuck situation like ours?
> 
> Only by adding back an OSD with the data needed to reach min_size for
> said pg, which is expected behavior. Even with some experimentations
> with isolated unstable OSDs I've not yet witnessed a case where Ceph
> lost multiple replicates simultaneously (we lost one OSD to disk failure
> and another to a BTRFS bug but without trying to recover the filesystem
> so we might have been able to recover this OSD).
> 
> If your setup is susceptible to situations where you can lose all
> replicates you will lose data but there's not much that can be done
> about that. Ceph actually begins to generate new replicates to replace
> the missing onesafter"mon osd down out interval" so the actual loss
> should not happen unless you lose (and can't recover) <size> OSDs on
> separate hosts (with default crush map) simultaneously. Before going in
> production you should know how long Ceph will take to fully recover from
> a disk or host failure by testing it with load. Your setup might not be
> robust if it hasn't the available disk space or the speed needed to
> recover quickly from such a failure.
> 
> Lionel
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Eichelmann
Systemadministrator

1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
christian.eichelmann@xxxxxxxx

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com