# ceph pg repair 1.32
instructing pg 1.32 on osd.6 to repair
2016-05-04 11:19:50.528355 mon.0 [INF] from='client.? 192.168.2.224:0/1341169978' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "1.32"}]: dispatch
# ceph osd pool ls detail
pool 0 'rbd' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 349499 flags hashpspool stripe_width 0
removed_snaps [1~d]
pool 1 'cephfs_data' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 300 pgp_num 300 last_change 349490 lfor 25902 flags hashpspool crash_replay_interval 45 tiers 4 read_tier 4 write_tier 4 stripe_width 0
pool 2 'cephfs_metadata' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 300 pgp_num 300 last_change 349503 flags hashpspool stripe_width 0
pool 4 'ssd_cache' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 349490 flags hashpspool,incomplete_clones tier_of 1 cache_mode writeback target_bytes 126701535232 target_objects 1000000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x2 min_read_recency_for_promote 1 stripe_width 0
# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-12 0.39999 root ssd_cache
-4 0.20000 host node11
8 0.20000 osd.8 up 1.00000 1.00000
-11 0.20000 host node13
0 0.20000 osd.0 up 1.00000 1.00000
-1 2.99997 root default
-7 0.29999 host node6
7 0.29999 osd.7 up 0.72400 1.00000
-8 0.23000 host node5
5 0.23000 osd.5 up 0.67996 1.00000
-6 0.45999 host node12
9 0.45999 osd.9 up 0.72157 1.00000
-10 0.67000 host node14
10 0.67000 osd.10 up 0.70659 1.00000
-13 0.67000 host node22
6 0.67000 osd.6 up 0.69070 1.00000
-15 0.67000 host node21
11 0.67000 osd.11 up 0.69788 1.00000
--
I tried some commands to "accept the missing objects as lost" but it tells me:
# ceph pg 1.32 mark_unfound_lost delete
pg has no unfound objects
2016-05-04 11:31:03.742453 9b088350 0 osd.6 350327 do_command r=0
2016-05-04 11:31:03.763017 9b088350 0 osd.6 350327 do_command r=0 pg has no unfound objects
2016-05-04 11:31:03.763066 9b088350 0 log_channel(cluster) log [INF] : pg has no unfound objects
I also tried to "force create" the page:
# ceph pg force_create_pg 1.32
pg 1.32 now creating, ok
# ceph pg force_create_pg 1.32
pg 1.32 now creating, ok
In that case, I do see a dispatch:
2016-05-04 11:32:42.073625 mon.4 [INF] from='client.? 192.168.2.224:0/208882728' entity='client.admin' cmd=[{"prefix": "pg force_create_pg", "pgid": "1.32"}]: dispatch
2016-05-04 11:32:42.075024 mon.0 [INF] from='client.17514719 :/0' entity='client.admin' cmd=[{"prefix": "pg force_create_pg", "pgid": "1.32"}]: dispatch
2016-05-04 11:32:42.183389 mon.0 [INF] from='client.17514719 :/0' entity='client.admin' cmd='[{"prefix": "pg force_create_pg", "pgid": "1.32"}]': finished
2016-05-04 11:32:42.073625 mon.4 [INF] from='client.? 192.168.2.224:0/208882728' entity='client.admin' cmd=[{"prefix": "pg force_create_pg", "pgid": "1.32"}]: dispatch
2016-05-04 11:32:42.075024 mon.0 [INF] from='client.17514719 :/0' entity='client.admin' cmd=[{"prefix": "pg force_create_pg", "pgid": "1.32"}]: dispatch
2016-05-04 11:32:42.183389 mon.0 [INF] from='client.17514719 :/0' entity='client.admin' cmd='[{"prefix": "pg force_create_pg", "pgid": "1.32"}]': finished
That puts the page in a new state for a while:
# ceph health detail | grep 1.32
pg 1.32 is stuck inactive since forever, current state creating, last acting []
pg 1.32 is stuck unclean since forever, current state creating, last acting []
# ceph health detail | grep 1.32
pg 1.32 is stuck inactive since forever, current state creating, last acting []
pg 1.32 is stuck unclean since forever, current state creating, last acting []
But after a few minutes it returns to the previous state:
# ceph health detail | grep 1.32
pg 1.32 is stuck inactive for 160741.831891, current state undersized+degraded+peered, last acting [6]
pg 1.32 is stuck unclean for 1093042.263678, current state undersized+degraded+peered, last acting [6]
pg 1.32 is stuck undersized for 57229.481051, current state undersized+degraded+peered, last acting [6]
pg 1.32 is stuck degraded for 57229.481382, current state undersized+degraded+peered, last acting [6]
pg 1.32 is undersized+degraded+peered, acting [6]
# ceph health detail | grep 1.32
pg 1.32 is stuck inactive for 160741.831891, current state undersized+degraded+peered, last acting [6]
pg 1.32 is stuck unclean for 1093042.263678, current state undersized+degraded+peered, last acting [6]
pg 1.32 is stuck undersized for 57229.481051, current state undersized+degraded+peered, last acting [6]
pg 1.32 is stuck degraded for 57229.481382, current state undersized+degraded+peered, last acting [6]
pg 1.32 is undersized+degraded+peered, acting [6]
Blade.
On Tue, May 3, 2016 at 10:45 AM, Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> wrote:
Hi Blade,
if you dont see anything in the logs, then you should raise the debug
level/frequency.
You must at least see, that the repair command has been issued ( started ).
Also i am wondering about the [6] from your output.
That means, that there is only 1 copy of it ( on osd.6 ).
What is your setting for the minimal required copies ?
osd_pool_default_min_size = ??
And whats the setting for the to create copies ?
osd_pool_default_size = ???
Please give us the output of
ceph osd pool ls detail
--
Mit freundlichen Gruessen / Best regards
Oliver Dzombic
IP-Interactive
mailto:info@xxxxxxxxxxxxxxxxx
Anschrift:
IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen
HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic
Steuer Nr.: 35 236 3622 1
UST ID: DE274086107
Am 03.05.2016 um 19:11 schrieb Blade Doyle:
> Hi Oliver,
>
> Thanks for your reply.
>
> The problem could have been caused by crashing/flapping OSD's. The
> cluster is stable now, but lots of pg problems remain.
>
> $ ceph health
> HEALTH_ERR 4 pgs degraded; 158 pgs inconsistent; 4 pgs stuck degraded; 1
> pgs stuck inactive; 10 pgs stuck unclean; 4 pgs stuck undersized; 4 pgs
> undersized; recovery 1489/523934 objects degraded (0.284%); recovery
> 2620/523934 objects misplaced (0.500%); 158 scrub errors
>
> Example: for pg 1.32 :
>
> $ ceph health detail | grep "pg 1.32"
> pg 1.32 is stuck inactive for 13260.118985, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck unclean for 945560.550800, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck undersized for 12855.304944, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck degraded for 12855.305305, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is undersized+degraded+peered, acting [6]
>
> I tried various things like:
>
> $ ceph pg repair 1.32
> instructing pg 1.32 on osd.6 to repair
>
> $ ceph pg deep-scrub 1.32
> instructing pg 1.32 on osd.6 to deep-scrub
>
> Its odd that I never do see any log on osd.6 about scrubbing or
> repairing that pg (after waiting many hours). I attached "ceph pg
> query" and a grep of osd logs for that page. If there is a better way
> to provide large logs please let me know.
>
> For reference the last mention of that pg in the logs is:
>
> 2016-04-30 09:24:44.703785 975b9350 20 osd.6 349418 kicking pg 1.32
> 2016-04-30 09:24:44.703880 975b9350 30 osd.6 pg_epoch: 349418 pg[1.32( v
> 338815'7745 (20981'4727,338815'7745] local-les=349347 n=435 ec=17 les/c
> 349347/349347 349418/349418/349418) [] r=-1 lpr=349418
> pi=349346-349417/1 crt=338815'7743 lcod 0'0 inactive NOTIFY] lock
>
>
> Suggestions appreciated,
> Blade.
>
>
>
>
> On Sat, Apr 30, 2016 at 9:31 AM, Blade Doyle <blade.doyle@xxxxxxxxx
> <mailto:blade.doyle@xxxxxxxxx>> wrote:
>
> Hi Ceph-Users,
>
> Help with how to resolve these would be appreciated.
>
> 2016-04-30 09:25:58.399634 9b809350 0 log_channel(cluster) log
> [INF] : 4.97 deep-scrub starts
> 2016-04-30 09:26:00.041962 93009350 0 -- 192.168.2.52:6800/6640
> <http://192.168.2.52:6800/6640> >> 192.168.2.32:0/3983425916
> <http://192.168.2.32:0/3983425916> pipe(0x27406000 sd=111 :6800 s=0
> pgs=0 cs=0 l=0 c=0x272da0a0).accept peer addr is really
> 192.168.2.32:0/3983425916 <http://192.168.2.32:0/3983425916> (socket
> is 192.168.2.32:38514/0 <http://192.168.2.32:38514/0>)
> 2016-04-30 09:26:15.415883 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 deep-scrub stat mismatch, got 284/282 objects, 0/0
> clones, 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137
> whiteouts, 365855441/365855441 bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.415953 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 deep-scrub 1 errors
> 2016-04-30 09:26:15.416425 9b809350 0 log_channel(cluster) log
> [INF] : 4.97 scrub starts
> 2016-04-30 09:26:15.682311 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 scrub stat mismatch, got 284/282 objects, 0/0 clones,
> 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts,
> 365855441/365855441 bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.682392 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 scrub 1 errors
>
> Thanks Much,
> Blade.
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com