Re: Replacing a failed disk/OSD: unfound object

Meng Zhao <mzhao@xxxxxxxxxxxx> · Tue, 12 Jul 2011 17:38:10 +0800

Thanks Tommi.  I rebuilt the ceph cluster a few times just to reproduce 
the situation. The result seems mixed, more likely btrfs failed (after 
power reset). But it does happen anyway.

The big question is: However rare, unfound object situation makes the 
*entire* ceph file system not mountable, leads to a total lost of all 
data. That is quite a risk to take for production system.  Is there a 
way to recover from such situation ?(e.g. remove the file associated 
with the missing object)

On Fri, 8 Jul 2011 10:08:46 -0700, Tommi Virtanen wrote:
[It seems I dropped the Cc: to ceph-devel, added it back.. Please
reply to this message instead, and sorry about that. I'm starting to
dislike Google Apps for mailing list traffic :( ]

On Fri, Jul 8, 2011 at 10:07, Tommi Virtanen
<tommi.virtanen@xxxxxxxxxxxxx> wrote:
On Fri, Jul 8, 2011 at 01:23, Meng Zhao <mzhao@xxxxxxxxxxxx> wrote:
I was trying to replace a disk for an osd by following instruction 
at:
http://ceph.newdream.net/wiki/Replacing_a_failed_disk/OSD

Now, ceph -w getting
2011-07-08 15:52:39.702881    pg v1602: 602 pgs: 49 
active+degraded, 553
active+clean+degraded; 349 MB data, 333 MB used, 566 MB / 1023 MB 
avail;
167/224 degraded (74.554%); 55/112 unfound (49.107%)

and a copy operation hang on the ceph client forever. I cannot kill 
(-9) the
cp process. Is there any hope to recover my ceph filesystem?

I'm pretty sure that the cp is hanging because Ceph chooses to wait 
in
case the unfound objects do come back (e.g. an OSD comes back 
online).

Now, in the default configuration, losing a single OSD should not 
have
caused unfound objects in the first place. Can you provide any more
information on how you got to that point?

My general question is: How are objects distributed among OSDs? 
Does
duplication (2x) guarantee that a failure of a single OSD would not 
lose
data?  It appears to me that the objects are statistically 
redistributed and
does not guarantee physical separation of replication data 
location.

The CRUSH logic has a special case for that: when picking replicas, 
if
it would pick an already used bucket, it tries again.

My understanding is that with the default crushmap, replicas will go
to different OSDs, and this is what I see in practice. If you did
construct your crushmap, it's possible that you made one that allows
multiple replicas on the same drive. Or even configured Ceph to
maintain no replicas.

It is also possible you got hit by a bug, most likely one not in the
placement rules but in the OSDs code.

Can you provide more information on your setup and the steps you 
took?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html