Re: Scrub Errors

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Tue, 3 May 2016 19:45:45 +0200

Hi Blade,

if you dont see anything in the logs, then you should raise the debug
level/frequency.

You must at least see, that the repair command has been issued  ( started ).

Also i am wondering about the [6] from your output.

That means, that there is only 1 copy of it ( on osd.6 ).

What is your setting for the minimal required copies ?

osd_pool_default_min_size = ??

And whats the setting for the to create copies ?

osd_pool_default_size = ???

Please give us the output of

ceph osd pool ls detail

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 03.05.2016 um 19:11 schrieb Blade Doyle:
> Hi Oliver,
> 
> Thanks for your reply.
> 
> The problem could have been caused by crashing/flapping OSD's. The
> cluster is stable now, but lots of pg problems remain.
> 
> $ ceph health
> HEALTH_ERR 4 pgs degraded; 158 pgs inconsistent; 4 pgs stuck degraded; 1
> pgs stuck inactive; 10 pgs stuck unclean; 4 pgs stuck undersized; 4 pgs
> undersized; recovery 1489/523934 objects degraded (0.284%); recovery
> 2620/523934 objects misplaced (0.500%); 158 scrub errors
> 
> Example: for pg 1.32 :
> 
> $ ceph health detail | grep "pg 1.32"
> pg 1.32 is stuck inactive for 13260.118985, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck unclean for 945560.550800, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck undersized for 12855.304944, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck degraded for 12855.305305, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is undersized+degraded+peered, acting [6]
> 
> I tried various things like:
> 
> $ ceph pg repair 1.32
> instructing pg 1.32 on osd.6 to repair
> 
> $ ceph pg deep-scrub 1.32
> instructing pg 1.32 on osd.6 to deep-scrub
> 
> Its odd that I never do see any log on osd.6 about scrubbing or
> repairing that pg (after waiting many hours).  I attached "ceph pg
> query" and a grep of osd logs for that page.  If there is a better way
> to provide large logs please let me know.
> 
> For reference the last mention of that pg in the logs is:
> 
> 2016-04-30 09:24:44.703785 975b9350 20 osd.6 349418  kicking pg 1.32
> 2016-04-30 09:24:44.703880 975b9350 30 osd.6 pg_epoch: 349418 pg[1.32( v
> 338815'7745 (20981'4727,338815'7745] local-les=349347 n=435 ec=17 les/c
> 349347/349347 349418/349418/349418) [] r=-1 lpr=349418
> pi=349346-349417/1 crt=338815'7743 lcod 0'0 inactive NOTIFY] lock
> 
> 
> Suggestions appreciated,
> Blade.
> 
> 
> 
> 
> On Sat, Apr 30, 2016 at 9:31 AM, Blade Doyle <blade.doyle@xxxxxxxxx
> <mailto:blade.doyle@xxxxxxxxx>> wrote:
> 
>     Hi Ceph-Users,
> 
>     Help with how to resolve these would be appreciated.
> 
>     2016-04-30 09:25:58.399634 9b809350  0 log_channel(cluster) log
>     [INF] : 4.97 deep-scrub starts
>     2016-04-30 09:26:00.041962 93009350  0 -- 192.168.2.52:6800/6640
>     <http://192.168.2.52:6800/6640> >> 192.168.2.32:0/3983425916
>     <http://192.168.2.32:0/3983425916> pipe(0x27406000 sd=111 :6800 s=0
>     pgs=0 cs=0 l=0 c=0x272da0a0).accept peer addr is really
>     192.168.2.32:0/3983425916 <http://192.168.2.32:0/3983425916> (socket
>     is 192.168.2.32:38514/0 <http://192.168.2.32:38514/0>)
>     2016-04-30 09:26:15.415883 9b809350 -1 log_channel(cluster) log
>     [ERR] : 4.97 deep-scrub stat mismatch, got 284/282 objects, 0/0
>     clones, 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137
>     whiteouts, 365855441/365855441 bytes,340/340 hit_set_archive bytes.
>     2016-04-30 09:26:15.415953 9b809350 -1 log_channel(cluster) log
>     [ERR] : 4.97 deep-scrub 1 errors
>     2016-04-30 09:26:15.416425 9b809350  0 log_channel(cluster) log
>     [INF] : 4.97 scrub starts
>     2016-04-30 09:26:15.682311 9b809350 -1 log_channel(cluster) log
>     [ERR] : 4.97 scrub stat mismatch, got 284/282 objects, 0/0 clones,
>     145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts,
>     365855441/365855441 bytes,340/340 hit_set_archive bytes.
>     2016-04-30 09:26:15.682392 9b809350 -1 log_channel(cluster) log
>     [ERR] : 4.97 scrub 1 errors
> 
>     Thanks Much,
>     Blade.
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com