stubborn/sticky scrub errors

Ronny Aasen <ronny+ceph-users@xxxxxxxx> · Sat, 3 Sep 2016 13:52:55 +0200

hello

I am running ceph hammer on debian jessie.  using 6 old used 
underwhelming servers

the cluster is a "in-migration" bastard mix of 3TB sata drives with on 
disk journal partition,  beeing migrated to 5 disk raid5 MD arrays with 
ssd journals, for ram limitation reasons. There are about 18 raid5 sets 
atm and the rest is 3TB spinners.

I have some challenges with scrub errors, that i am trying to sort out 
using this http://ceph.com/planet/ceph-manually-repair-object/ method. 
but they are quite stubborn/sticky

i do see that osd.8 is often represented in these inconsistencies. but 
the broken objects are not allways on osd.8 itself

in the instructions at 
http://ceph.com/planet/ceph-manually-repair-object/, one finds the 
object name by grepping in the logs.
but some of these haven been here a while. so how can i identify the 
broken object if the log file have been rotated away ?

in the end i move away the broken object with size 0 and run pg repair, 
but the error is not removed.
does the pg  need to scrub after the repair for it to clear the error. ?

any advice is appreciated

kind regards
Ronny Aasen

#ceph -s
    cluster 3c229f54-bd12-4b4e-a143-1ec73dd0f12a
     health HEALTH_ERR
            3 pgs degraded
            9 pgs inconsistent
            3 pgs recovering
            3 pgs stuck degraded
            3 pgs stuck unclean
            recovery 88/125583766 objects degraded (0.000%)
            recovery 666778/125583766 objects misplaced (0.531%)
            recovery 88/45311043 unfound (0.000%)
            9 scrub errors
            noout,noscrub,nodeep-scrub flag(s) set
     monmap e1: 3 mons at 
{mon1=10.24.11.11:6789/0,mon2=10.24.11.12:6789/0,mon3=10.24.11.13:6789/0}
            election epoch 60, quorum 0,1,2 mon1,mon2,mon3
     osdmap e105977: 92 osds: 92 up, 92 in; 2 remapped pgs
            flags noout,noscrub,nodeep-scrub
      pgmap v12896186: 4608 pgs, 3 pools, 117 TB data, 44249 kobjects
            308 TB used, 107 TB / 416 TB avail
            88/125583766 objects degraded (0.000%)
            666778/125583766 objects misplaced (0.531%)
            88/45311043 unfound (0.000%)
                4593 active+clean
                   9 active+clean+inconsistent
                   3 active+clean+scrubbing
                   2 active+recovering+degraded+remapped
                   1 active+recovering+degraded
  client io 4572 kB/s rd, 1141 op/s

# ceph health detail
HEALTH_ERR 3 pgs degraded; 9 pgs inconsistent; 3 pgs recovering; 3 pgs 
stuck degraded; 3 pgs stuck unclean; recovery 88/125583766 objects 
degraded (0.000%); recovery 666778/125583766 objects misplaced (0.531%); 
recovery 88/45311043 unfound (0.000%); 9 scrub errors; 
noout,noscrub,nodeep-scrub flag(s) set
pg 6.d4 is stuck unclean for 3770820.461291, current state 
active+recovering+degraded+remapped, last acting [62,8]
pg 6.da is stuck unclean for 2420102.778679, current state 
active+recovering+degraded, last acting [6,110]
pg 6.ab is stuck unclean for 3774233.330685, current state 
active+recovering+degraded+remapped, last acting [12,8]
pg 6.d4 is stuck degraded for 304239.715211, current state 
active+recovering+degraded+remapped, last acting [62,8]
pg 6.da is stuck degraded for 416210.309539, current state 
active+recovering+degraded, last acting [6,110]
pg 6.ab is stuck degraded for 304239.779541, current state 
active+recovering+degraded+remapped, last acting [12,8]
pg 1.356 is active+clean+inconsistent, acting [8,84,39]
pg 1.1a7 is active+clean+inconsistent, acting [8,36,34]
pg 1.11e is active+clean+inconsistent, acting [8,12,6]
pg 6.da is active+recovering+degraded, acting [6,110], 25 unfound
pg 6.d4 is active+recovering+degraded+remapped, acting [62,8], 25 unfound
pg 6.ab is active+recovering+degraded+remapped, acting [12,8], 38 unfound
pg 1.de4 is active+clean+inconsistent, acting [41,8,108]
pg 1.c90 is active+clean+inconsistent, acting [12,71,8]
pg 1.ae6 is active+clean+inconsistent, acting [8,36,49]
pg 1.8bc is active+clean+inconsistent, acting [59,8,107]
pg 1.806 is active+clean+inconsistent, acting [60,3,106]
pg 1.675 is active+clean+inconsistent, acting [37,106,62]
recovery 88/125583766 objects degraded (0.000%)
recovery 666778/125583766 objects misplaced (0.531%)
recovery 88/45311043 unfound (0.000%)
9 scrub errors
noout,noscrub,nodeep-scrub flag(s) set

NB: the 88 unfound objects are in a pool i experimented with size 2, so 
not important in this context.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com