Re: pgs stuck unclean after reweight

Christian Balzer <chibi@xxxxxxx> · Mon, 25 Jul 2016 22:56:00 +0900

Hello,

On Sun, 24 Jul 2016 00:54:09 +0000 Goncalo Borges wrote:

> Hi Christian
> Thanks for the tips.
> We do have monitoring in place but we are currently on a peak and the occupancy increased tremendously in a couple of days time.
> 
> I solved the problem of the stucked pgs by reweight (decreasing weights) of the new osds which were preventing the backfilling. Once those 4 pgs recovered i applied your suggestion of increasing weight os the less used osds. Cluster is much more balanced now and we will add more osds soon. . It is still a mystery to me why in my initial procedure which triggered the problem, heavy used osds were chosen for the remapping. 
>

Ceph (as in CRUSH) knows nothing and cares even less about how full
specific OSDs are, the algorithm places/distributes PGs (and thus data),
tuned by the assigned weight.

That's why the smart move is to:
a) keep things smooth (as in evenly distributed) from the start by
adjusting weights, so an outlier doesn't stop your whole cluster.

b) when evening things out, move data towards the less full OSDs by raising
their weight. 

Christian

> Thanks for the help
> Goncalo
> 
> 
> ________________________________________
> From: Christian Balzer [chibi@xxxxxxx]
> Sent: 20 July 2016 19:36
> To: ceph-users@xxxxxxxx
> Cc: Goncalo Borges
> Subject: Re:  pgs stuck unclean after reweight
> 
> Hello,
> 
> On Wed, 20 Jul 2016 13:42:20 +1000 Goncalo Borges wrote:
> 
> > Hi All...
> >
> > Today we had a warning regarding 8 near full osd. Looking to the osds
> > occupation, 3 of them were above 90%.
> 
> One would hope that this would have been picked up earlier, as in before
> it even reaches near-full.
> Either by monitoring (nagios, etc) disk usage checks and/or graphing the
> usage and taking a look at it at least daily.
> 
> Since you seem to have at least 60 OSDs going from below 85% to 90% must
> 
> > In order to solve the situation,
> > I've decided to reweigh those first using
> >
> >      ceph osd crush reweight osd.1 2.67719
> >
> >      ceph osd crush reweight osd.26 2.67719
> >
> >      ceph osd crush reweight osd.53 2.67719
> >
> What I'd do is to find the least utilized OSDs and give them higher
> weights, so data will (hopefully) move there instead of potentially
> pushing another OSD to near-full as with the approach above.
> 
> You might consider doing that aside from what I'm writing below.
> 
> > Please note that I've started with a very conservative step since the
> > original weight for all osds was 2.72710.
> >
> > After some rebalancing (which has now stopped) I've seen that the
> > cluster is currently in the following state
> >
> >     # ceph health detail
> >     HEALTH_WARN 4 pgs backfill_toofull; 4 pgs stuck unclean; recovery
> >     20/39433323 objects degraded (0.000%); recovery 77898/39433323
> >     objects misplaced (0.198%); 8 near full osd(s); crush map has legacy
> >     tunables (require bobtail, min is firefly)
> >
> So there are all your woes in one fell swoop.
> 
> Unless you changed the defaults, your mon_osd_nearfull_ratio and
> osd_backfill_full_ratio are the same at 0.85.
> So any data movement towards those 8 near full OSDs will not go anywhere.
> 
> Thus aside from the tip above, consider upping your
> osd_backfill_full_ratio for those OSDs to something like .92 for the time
> being until things are good again.
> 
> Going forward, you will want to:
> a) add more OSDs
> b) re-weight things so that your OSDs are within a few % of each other
> than the often encountered 20%+ variance.
> 
> Christian
> 
> >     pg 6.e2 is stuck unclean for 9578.920997, current state
> >     active+remapped+backfill_toofull, last acting [49,38,11]
> >     pg 6.4 is stuck unclean for 9562.054680, current state
> >     active+remapped+backfill_toofull, last acting [53,6,26]
> >     pg 5.24 is stuck unclean for 10292.469037, current state
> >     active+remapped+backfill_toofull, last acting [32,13,51]
> >     pg 5.306 is stuck unclean for 10292.448364, current state
> >     active+remapped+backfill_toofull, last acting [44,7,59]
> >     pg 5.306 is active+remapped+backfill_toofull, acting [44,7,59]
> >     pg 5.24 is active+remapped+backfill_toofull, acting [32,13,51]
> >     pg 6.4 is active+remapped+backfill_toofull, acting [53,6,26]
> >     pg 6.e2 is active+remapped+backfill_toofull, acting [49,38,11]
> >     recovery 20/39433323 objects degraded (0.000%)
> >     recovery 77898/39433323 objects misplaced (0.198%)
> >     osd.1 is near full at 88%
> >     osd.14 is near full at 87%
> >     osd.24 is near full at 86%
> >     osd.26 is near full at 87%
> >     osd.37 is near full at 87%
> >     osd.53 is near full at 88%
> >     osd.56 is near full at 85%
> >     osd.62 is near full at 87%
> >
> >         crush map has legacy tunables (require bobtail, min is firefly);
> > see http://ceph.com/docs/master/rados/operations/crush-map/#tunables
> >
> > Not sure if it is worthwhile to mention, but after upgrading to Jewel,
> > our cluster shows the warnings regarding tunables. We still have not
> > migrated to the optimal tunables because the cluster will be very
> > actively used during the 3 next weeks ( due to one of the main
> > conference in our area) and we prefer to do that migration after this
> > peak period,
> >
> >
> > I am unsure what happen during the rebalacing but the mapping of these 4
> > stuck pgs seems strange, namely the up and acting osds are different.
> >
> >     # ceph pg dump_stuck unclean
> >     ok
> >     pg_stat    state    up    up_primary    acting    acting_primary
> >     6.e2    active+remapped+backfill_toofull    [8,53,38]    8
> >     [49,38,11]    49
> >     6.4    active+remapped+backfill_toofull    [53,24,6]    53
> >     [53,6,26]    53
> >     5.24    active+remapped+backfill_toofull    [32,13,56]    32
> >     [32,13,51]    32
> >     5.306    active+remapped+backfill_toofull    [44,60,26]    44
> >     [44,7,59]    44
> >
> >     # ceph pg map 6.e2
> >     osdmap e1054 pg 6.e2 (6.e2) -> up [8,53,38] acting [49,38,11]
> >
> >     # ceph pg map 6.4
> >     osdmap e1054 pg 6.4 (6.4) -> up [53,24,6] acting [53,6,26]
> >
> >     # ceph pg map 5.24
> >     osdmap e1054 pg 5.24 (5.24) -> up [32,13,56] acting [32,13,51]
> >
> >     # ceph pg map 5.306
> >     osdmap e1054 pg 5.306 (5.306) -> up [44,60,26] acting [44,7,59]
> >
> >
> > To complete this information, I am also sending the output of pg query
> > for one of these problematic pgs (ceph pg  5.306 query) after this email.
> >
> > What should be the procedure to try to recover those PGS before
> > continuing with the reweighing?
> >
> > Than you in advance
> > Goncalo
> >
> 
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com