Re: stubborn/sticky scrub errors

Ronny Aasen <ronny+ceph-users@xxxxxxxx> · Mon, 5 Sep 2016 12:39:02 +0200

there are.
and I did find the broken object by triggering a manual scrub and 
grepping the log file. I have had scrubbing disabled since it reduced 
client performance too much during recovery.

once i found the objects in question it was just a matter of following 
the example at http://ceph.com/planet/ceph-manually-repair-object

so the scrub errors are gone now.

kind regards
Ronny Aasen

On 04. sep. 2016 00:04, Brad Hubbard wrote:
There should actually be "[ERR]" messages in the osd logs some time after
"deep-scrub starts". Can we see those and a pg query for one of the affected
pgs?

-- Cheers, Brad On Sat, Sep 3, 2016 at 9:52 PM, Ronny Aasen
<ronny+ceph-users@xxxxxxxx> wrote:
> hello
>
> I am running ceph hammer on debian jessie.  using 6 old used underwhelming
> servers
>
> the cluster is a "in-migration" bastard mix of 3TB sata drives with on disk
> journal partition,  beeing migrated to 5 disk raid5 MD arrays with ssd
> journals, for ram limitation reasons. There are about 18 raid5 sets atm and
> the rest is 3TB spinners.
>
> I have some challenges with scrub errors, that i am trying to sort out using
> this http://ceph.com/planet/ceph-manually-repair-object/ method. but they
> are quite stubborn/sticky
>
> i do see that osd.8 is often represented in these inconsistencies. but the
> broken objects are not allways on osd.8 itself
>
> in the instructions at http://ceph.com/planet/ceph-manually-repair-object/,
> one finds the object name by grepping in the logs.
> but some of these haven been here a while. so how can i identify the broken
> object if the log file have been rotated away ?
>
> in the end i move away the broken object with size 0 and run pg repair, but
> the error is not removed.
> does the pg  need to scrub after the repair for it to clear the error. ?
>
> any advice is appreciated
>
> kind regards
> Ronny Aasen
>
>
> #ceph -s
>     cluster 3c229f54-bd12-4b4e-a143-1ec73dd0f12a
>      health HEALTH_ERR
>             3 pgs degraded
>             9 pgs inconsistent
>             3 pgs recovering
>             3 pgs stuck degraded
>             3 pgs stuck unclean
>             recovery 88/125583766 objects degraded (0.000%)
>             recovery 666778/125583766 objects misplaced (0.531%)
>             recovery 88/45311043 unfound (0.000%)
>             9 scrub errors
>             noout,noscrub,nodeep-scrub flag(s) set
>      monmap e1: 3 mons at
> {mon1=10.24.11.11:6789/0,mon2=10.24.11.12:6789/0,mon3=10.24.11.13:6789/0}
>             election epoch 60, quorum 0,1,2 mon1,mon2,mon3
>      osdmap e105977: 92 osds: 92 up, 92 in; 2 remapped pgs
>             flags noout,noscrub,nodeep-scrub
>       pgmap v12896186: 4608 pgs, 3 pools, 117 TB data, 44249 kobjects
>             308 TB used, 107 TB / 416 TB avail
>             88/125583766 objects degraded (0.000%)
>             666778/125583766 objects misplaced (0.531%)
>             88/45311043 unfound (0.000%)
>                 4593 active+clean
>                    9 active+clean+inconsistent
>                    3 active+clean+scrubbing
>                    2 active+recovering+degraded+remapped
>                    1 active+recovering+degraded
>   client io 4572 kB/s rd, 1141 op/s
>
> # ceph health detail
> HEALTH_ERR 3 pgs degraded; 9 pgs inconsistent; 3 pgs recovering; 3 pgs stuck
> degraded; 3 pgs stuck unclean; recovery 88/125583766 objects degraded
> (0.000%); recovery 666778/125583766 objects misplaced (0.531%); recovery
> 88/45311043 unfound (0.000%); 9 scrub errors; noout,noscrub,nodeep-scrub
> flag(s) set
> pg 6.d4 is stuck unclean for 3770820.461291, current state
> active+recovering+degraded+remapped, last acting [62,8]
> pg 6.da is stuck unclean for 2420102.778679, current state
> active+recovering+degraded, last acting [6,110]
> pg 6.ab is stuck unclean for 3774233.330685, current state
> active+recovering+degraded+remapped, last acting [12,8]
> pg 6.d4 is stuck degraded for 304239.715211, current state
> active+recovering+degraded+remapped, last acting [62,8]
> pg 6.da is stuck degraded for 416210.309539, current state
> active+recovering+degraded, last acting [6,110]
> pg 6.ab is stuck degraded for 304239.779541, current state
> active+recovering+degraded+remapped, last acting [12,8]
> pg 1.356 is active+clean+inconsistent, acting [8,84,39]
> pg 1.1a7 is active+clean+inconsistent, acting [8,36,34]
> pg 1.11e is active+clean+inconsistent, acting [8,12,6]
> pg 6.da is active+recovering+degraded, acting [6,110], 25 unfound
> pg 6.d4 is active+recovering+degraded+remapped, acting [62,8], 25 unfound
> pg 6.ab is active+recovering+degraded+remapped, acting [12,8], 38 unfound
> pg 1.de4 is active+clean+inconsistent, acting [41,8,108]
> pg 1.c90 is active+clean+inconsistent, acting [12,71,8]
> pg 1.ae6 is active+clean+inconsistent, acting [8,36,49]
> pg 1.8bc is active+clean+inconsistent, acting [59,8,107]
> pg 1.806 is active+clean+inconsistent, acting [60,3,106]
> pg 1.675 is active+clean+inconsistent, acting [37,106,62]
> recovery 88/125583766 objects degraded (0.000%)
> recovery 666778/125583766 objects misplaced (0.531%)
> recovery 88/45311043 unfound (0.000%)
> 9 scrub errors
> noout,noscrub,nodeep-scrub flag(s) set
>
> NB: the 88 unfound objects are in a pool i experimented with size 2, so not
> important in this context.
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com