Hi Blade, if you dont see anything in the logs, then you should raise the debug level/frequency. You must at least see, that the repair command has been issued ( started ). Also i am wondering about the [6] from your output. That means, that there is only 1 copy of it ( on osd.6 ). What is your setting for the minimal required copies ? osd_pool_default_min_size = ?? And whats the setting for the to create copies ? osd_pool_default_size = ??? Please give us the output of ceph osd pool ls detail -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:info@xxxxxxxxxxxxxxxxx Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 03.05.2016 um 19:11 schrieb Blade Doyle: > Hi Oliver, > > Thanks for your reply. > > The problem could have been caused by crashing/flapping OSD's. The > cluster is stable now, but lots of pg problems remain. > > $ ceph health > HEALTH_ERR 4 pgs degraded; 158 pgs inconsistent; 4 pgs stuck degraded; 1 > pgs stuck inactive; 10 pgs stuck unclean; 4 pgs stuck undersized; 4 pgs > undersized; recovery 1489/523934 objects degraded (0.284%); recovery > 2620/523934 objects misplaced (0.500%); 158 scrub errors > > Example: for pg 1.32 : > > $ ceph health detail | grep "pg 1.32" > pg 1.32 is stuck inactive for 13260.118985, current state > undersized+degraded+peered, last acting [6] > pg 1.32 is stuck unclean for 945560.550800, current state > undersized+degraded+peered, last acting [6] > pg 1.32 is stuck undersized for 12855.304944, current state > undersized+degraded+peered, last acting [6] > pg 1.32 is stuck degraded for 12855.305305, current state > undersized+degraded+peered, last acting [6] > pg 1.32 is undersized+degraded+peered, acting [6] > > I tried various things like: > > $ ceph pg repair 1.32 > instructing pg 1.32 on osd.6 to repair > > $ ceph pg deep-scrub 1.32 > instructing pg 1.32 on osd.6 to deep-scrub > > Its odd that I never do see any log on osd.6 about scrubbing or > repairing that pg (after waiting many hours). I attached "ceph pg > query" and a grep of osd logs for that page. If there is a better way > to provide large logs please let me know. > > For reference the last mention of that pg in the logs is: > > 2016-04-30 09:24:44.703785 975b9350 20 osd.6 349418 kicking pg 1.32 > 2016-04-30 09:24:44.703880 975b9350 30 osd.6 pg_epoch: 349418 pg[1.32( v > 338815'7745 (20981'4727,338815'7745] local-les=349347 n=435 ec=17 les/c > 349347/349347 349418/349418/349418) [] r=-1 lpr=349418 > pi=349346-349417/1 crt=338815'7743 lcod 0'0 inactive NOTIFY] lock > > > Suggestions appreciated, > Blade. > > > > > On Sat, Apr 30, 2016 at 9:31 AM, Blade Doyle <blade.doyle@xxxxxxxxx > <mailto:blade.doyle@xxxxxxxxx>> wrote: > > Hi Ceph-Users, > > Help with how to resolve these would be appreciated. > > 2016-04-30 09:25:58.399634 9b809350 0 log_channel(cluster) log > [INF] : 4.97 deep-scrub starts > 2016-04-30 09:26:00.041962 93009350 0 -- 192.168.2.52:6800/6640 > <http://192.168.2.52:6800/6640> >> 192.168.2.32:0/3983425916 > <http://192.168.2.32:0/3983425916> pipe(0x27406000 sd=111 :6800 s=0 > pgs=0 cs=0 l=0 c=0x272da0a0).accept peer addr is really > 192.168.2.32:0/3983425916 <http://192.168.2.32:0/3983425916> (socket > is 192.168.2.32:38514/0 <http://192.168.2.32:38514/0>) > 2016-04-30 09:26:15.415883 9b809350 -1 log_channel(cluster) log > [ERR] : 4.97 deep-scrub stat mismatch, got 284/282 objects, 0/0 > clones, 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 > whiteouts, 365855441/365855441 bytes,340/340 hit_set_archive bytes. > 2016-04-30 09:26:15.415953 9b809350 -1 log_channel(cluster) log > [ERR] : 4.97 deep-scrub 1 errors > 2016-04-30 09:26:15.416425 9b809350 0 log_channel(cluster) log > [INF] : 4.97 scrub starts > 2016-04-30 09:26:15.682311 9b809350 -1 log_channel(cluster) log > [ERR] : 4.97 scrub stat mismatch, got 284/282 objects, 0/0 clones, > 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts, > 365855441/365855441 bytes,340/340 hit_set_archive bytes. > 2016-04-30 09:26:15.682392 9b809350 -1 log_channel(cluster) log > [ERR] : 4.97 scrub 1 errors > > Thanks Much, > Blade. > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com