Hi Lionel, we have a ceph cluster with in sum about 1PB, 12 OSDs with 60 Disks, devided into 4 racks in 2 rooms, all connected with a dedicated 10G cluster network. Of course with a replication level of 3. We did about 9 Month intensive testing. Just like you, we were never experiences that kind of problems before. And incomplete PG was recovering as soon as at least one OSD holding a copy of it came back up. We still don't know what caused this specific error, but at no point there were more than two hosts down at the same time. Our pool has a min_size of 1. And after everything was up again, we had completely LOST 2 of 3 pg copies (the directories on the OSDs were empty) and the third copy was obvioulsy broken, because even manually injecting this pg into the other osds didn't changed anything. My main problem here is, that with even one incomplete PG your pool is rendered unusable. And there is currently no way to make ceph forget about the data of this pg and create it as an empty one. So the only way to make this pool usable again is to loose all your data in there. Which for me is just not acceptable. Regards, Christian Am 07.01.2015 21:10, schrieb Lionel Bouton: > On 12/30/14 16:36, Nico Schottelius wrote: >> Good evening, >> >> we also tried to rescue data *from* our old / broken pool by map'ing the >> rbd devices, mounting them on a host and rsync'ing away as much as >> possible. >> >> However, after some time rsync got completly stuck and eventually the >> host which mounted the rbd mapped devices decided to kernel panic at >> which time we decided to drop the pool and go with a backup. >> >> This story and the one of Christian makes me wonder: >> >> Is anyone using ceph as a backend for qemu VM images in production? > > Yes with Ceph 0.80.5 since September after extensive testing over > several months (including an earlier version IIRC) and some hardware > failure simulations. We plan to upgrade one storage host and one monitor > to 0.80.7 to validate this version over several months too before > migrating the others. > >> >> And: >> >> Has anyone on the list been able to recover from a pg incomplete / >> stuck situation like ours? > > Only by adding back an OSD with the data needed to reach min_size for > said pg, which is expected behavior. Even with some experimentations > with isolated unstable OSDs I've not yet witnessed a case where Ceph > lost multiple replicates simultaneously (we lost one OSD to disk failure > and another to a BTRFS bug but without trying to recover the filesystem > so we might have been able to recover this OSD). > > If your setup is susceptible to situations where you can lose all > replicates you will lose data but there's not much that can be done > about that. Ceph actually begins to generate new replicates to replace > the missing onesafter"mon osd down out interval" so the actual loss > should not happen unless you lose (and can't recover) <size> OSDs on > separate hosts (with default crush map) simultaneously. Before going in > production you should know how long Ceph will take to fully recover from > a disk or host failure by testing it with load. Your setup might not be > robust if it hasn't the available disk space or the speed needed to > recover quickly from such a failure. > > Lionel > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Eichelmann Systemadministrator 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelmann@xxxxxxxx Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com