Re: Replacing a failed disk/OSD: unfound object

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 12 Jul 2011 09:06:14 -0700 (PDT)

On Tue, 12 Jul 2011, Meng Zhao wrote:
> Thanks Tommi.  I rebuilt the ceph cluster a few times just to reproduce the
> situation. The result seems mixed, more likely btrfs failed (after power
> reset). But it does happen anyway.
> 
> The big question is: However rare, unfound object situation makes the *entire*
> ceph file system not mountable, leads to a total lost of all data. That is
> quite a risk to take for production system.  Is there a way to recover from
> such situation ?(e.g. remove the file associated with the missing object)

There isn't an easy way to do this currently, but it is very easy to add.  
I've added an issue to the tracker.

The real question is whether the lost object is file data or namespace 
metadata.  If it's namespace metadata, you may lose part of the directory 
hierarchy (until the fsck/namespace rebuild work is done).  For file data, 
it's easy enough to just mark the unfound object as deleted.

sage

> 
> 
> On Fri, 8 Jul 2011 10:08:46 -0700, Tommi Virtanen wrote:
> > [It seems I dropped the Cc: to ceph-devel, added it back.. Please
> > reply to this message instead, and sorry about that. I'm starting to
> > dislike Google Apps for mailing list traffic :( ]
> > 
> > On Fri, Jul 8, 2011 at 10:07, Tommi Virtanen
> > <tommi.virtanen@xxxxxxxxxxxxx> wrote:
> > > On Fri, Jul 8, 2011 at 01:23, Meng Zhao <mzhao@xxxxxxxxxxxx> wrote:
> > > > I was trying to replace a disk for an osd by following instruction at:
> > > > http://ceph.newdream.net/wiki/Replacing_a_failed_disk/OSD
> > > > 
> > > > Now, ceph -w getting
> > > > 2011-07-08 15:52:39.702881    pg v1602: 602 pgs: 49 active+degraded, 553
> > > > active+clean+degraded; 349 MB data, 333 MB used, 566 MB / 1023 MB avail;
> > > > 167/224 degraded (74.554%); 55/112 unfound (49.107%)
> > > > 
> > > > and a copy operation hang on the ceph client forever. I cannot kill (-9)
> > > > the
> > > > cp process. Is there any hope to recover my ceph filesystem?
> > > 
> > > I'm pretty sure that the cp is hanging because Ceph chooses to wait in
> > > case the unfound objects do come back (e.g. an OSD comes back online).
> > > 
> > > Now, in the default configuration, losing a single OSD should not have
> > > caused unfound objects in the first place. Can you provide any more
> > > information on how you got to that point?
> > > 
> > > > My general question is: How are objects distributed among OSDs? Does
> > > > duplication (2x) guarantee that a failure of a single OSD would not lose
> > > > data?  It appears to me that the objects are statistically redistributed
> > > > and
> > > > does not guarantee physical separation of replication data location.
> > > 
> > > The CRUSH logic has a special case for that: when picking replicas, if
> > > it would pick an already used bucket, it tries again.
> > > 
> > > My understanding is that with the default crushmap, replicas will go
> > > to different OSDs, and this is what I see in practice. If you did
> > > construct your crushmap, it's possible that you made one that allows
> > > multiple replicas on the same drive. Or even configured Ceph to
> > > maintain no replicas.
> > > 
> > > It is also possible you got hit by a bug, most likely one not in the
> > > placement rules but in the OSDs code.
> > > 
> > > Can you provide more information on your setup and the steps you took?
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>