Re: Ceph pgs stuck or degraded.

Sage Weil <sage@xxxxxxxxxxx> · Mon, 22 Jul 2013 14:43:47 -0700 (PDT)

On Mon, 22 Jul 2013, Gaylord Holder wrote:
> If I understand what the #tunables page is saying, changing the tunables kicks
> the OSD re-balancing mechanism a bit and resets it to try again.
> 
> I'll see about getting 3.9 kernel in for my RBD maachines, and reset
> everything to optimal.

Keep in mind this is only needed if you are using the kernel rbd client 
(rbd map ...), not librbd + qemu or similar.

sage

> 
> Thanks again.
> 
> -Gaylord
> 
> On 07/22/2013 04:51 PM, Sage Weil wrote:
> > On Mon, 22 Jul 2013, Gaylord Holder wrote:
> > > Sage,
> > > 
> > > The crush tunables did the trick.
> > > 
> > > why?  Could you explain what was causing the problem?
> > 
> > This has a good explanation, I think:
> > 
> > 	http://ceph.com/docs/master/rados/operations/crush-map/#tunables
> > 
> > > I've haven't installed 3.9 on my RBD servers yet.  Will setting crush
> > > tunables
> > > back to default or legacy cause me similar problems in the future?
> > 
> > Yeah.  For 3.6+ kernels, you can set slightly different tunables and it
> > will be very close to optimal...
> > 
> > sage
> > 
> > 
> > > 
> > > Thank you again Sage!
> > > 
> > > -Gaylord
> > > 
> > > On 07/22/2013 02:27 PM, Sage Weil wr:
> > > > On Mon, 22 Jul 2013, Gaylord Holder wrote:
> > > > > 
> > > > > I have a 12 OSD/3 host set up, and have be stuck with a bunch of stuck
> > > > > pages.
> > > > > 
> > > > > I've verified the OSDs are all up and in.  The crushmap looks fine.
> > > > > I've tried restarting all the daemons.
> > > > > 
> > > > > 
> > > > > 
> > > > > root@never:/var/lib/ceph/mon# ceph status
> > > > >      health HEALTH_WARN 139 pgs degraded; 461 pgs stuck unclean;
> > > > > recovery
> > > > > 216/6213 degraded (3.477%)
> > > > >      monmap e4: 2 mons at
> > > > > {a=192.168.225.9:6789/0,b=192.168.225.10:6789/0},
> > > > > election epoch 14, quorum 0,1 a,b
> > > > 
> > > > Add another monitor; right now if 1 fails the cluster is unavailable.
> > > > 
> > > > >      osdmap e238: 12 osds: 12 up, 12 in
> > > > >       pgmap v7396: 2528 pgs: 2067 active+clean, 322 active+remapped,
> > > > > 139
> > > > > active+degraded; 8218 MB data, 103 GB used, 22241 GB / 22345 GB avail;
> > > > > 216/6213 degraded (3.477%)
> > > > >      mdsmap e1: 0/0/1 up
> > > > 
> > > > My guess crush tunables.  Try
> > > > 
> > > >    ceph osd crush tunables optimal
> > > > 
> > > > unless you are using a pre-3.8(ish) kernel or other very old
> > > > (pre-bobtail)
> > > > clients.
> > > > 
> > > > sage
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > I have one non-default pool with 3x replication.  Fewer than half of
> > > > > the
> > > > > pg
> > > > > have expanded to 3x (278/400 pgs still have acting 2x sets).
> > > > > 
> > > > > Where can I go look for the trouble?
> > > > > 
> > > > > Thank you for any light someone can shed on this.
> > > > > 
> > > > > Cheers,
> > > > > -Gaylord
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > 
> > > > > 
> > > 
> > > 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com