Re: Unexpected pg placement in degraded mode with custom crush rule

Sage Weil <sage@xxxxxxxxxxx> · Fri, 5 Jul 2013 08:46:19 -0700 (PDT)

On Fri, 5 Jul 2013, Mark Kirkwood wrote:
> Retesting with 0.61.4:
> 
> Immediately after stopping 2 osd in rack1:
> 
> 2013-07-05 16:23:02.852386 mon.0 [INF] pgmap v450: 1160 pgs: 1160
> active+degraded; 2000 MB data, 12991 MB used, 6135 MB / 20150 MB avail;
> 100/200 degraded (50.000%)
> 
> ... time passes:
> 
> 2013-07-05 16:51:03.248198 mon.0 [INF] pgmap v465: 1160 pgs: 1160
> active+degraded; 2000 MB data, 12993 MB used, 6133 MB / 20150 MB avail;
> 100/200 degraded (50.000%)
> 
> So looks like Cuttlefish is behaving as expected. Is this due to tweaks in the
> 'choose' algorithm in the later code?

Yes.  Glad to hear it's working!

Just keep in that when moving from one map/distribution to another, if we 
find that the old distribution provided more locations than the new one 
(e.g., because a rack is down), rados will keep the old copy around.  I 
didn't follow your procedure closely, but that may explain part of what 
you saw.

Cheers-
sage

> 
> Cheers
> 
> Mark
> 
> On 05/07/13 16:32, Mark Kirkwood wrote:
> > Hi Sage,
> > 
> > I don't believe so, I'm loading the objects directly from another host
> > (which is running 0.64 built from src) with:
> > 
> > $ rados -m 192.168.122.21 -p obj put smallnode$n.dat smallnode.dat   #
> > $n=0->99
> > 
> > and the osd's are all running 0.56.6, so I don't think there is any kernel
> > rbd or librbd involved.
> > 
> > 
> > I did try:
> > 
> > $ ceph osd crush tunables optimal
> > 
> > In one run - no difference.
> > 
> > I have updated to 0.61.4 and am running the test again, will update with the
> > results!
> > 
> > Cheers
> > 
> > Mark
> > 
> > On 05/07/13 16:01, Sage Weil wrote:
> > > Hi Mark,
> > > 
> > > If you're not using a kernel cephfs or rbd client older than ~3.9, or
> > > ceph-fuse/librbd/librados older than bobtail, then you should
> > > 
> > >   ceph osd crush tunables optimal
> > > 
> > > and I suspect that this will suddenly work perfectly.  The defaults are
> > > still using semi-broken legacy values because client support is pretty
> > > new.  Trees like yours, with sparsely populated leaves, tend to be most
> > > affected.
> > > 
> > > (I bet you're seeing the rack separation rule violated because the
> > > previous copy of the PG was already there and ceph won't throw out old
> > > copies before creating new ones.)
> > > 
> > > 
> > 
> > -- 
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html