Hi Blade, you can try to set the min_size to 1, to get it back online, and if/when the error vanish ( maybe after another repair command ) you can set the min_size again to 2. you can try to simply out/down/?remove? the osd where it is on. -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:info@xxxxxxxxxxxxxxxxx Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 04.05.2016 um 22:46 schrieb Blade Doyle: > > When I issue the "ceph pg repair 1.32" command I *do* see it reported in > the "ceph -w" output but I *do not* see any new messages about page 1.32 > in the log of osd.6 - even if I turn debug messages way up. > > # ceph pg repair 1.32 > instructing pg 1.32 on osd.6 to repair > > (ceph -w shows) > 2016-05-04 11:19:50.528355 mon.0 [INF] from='client.? > 192.168.2.224:0/1341169978 <http://192.168.2.224:0/1341169978>' > entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "1.32"}]: > dispatch > > --- > > Yes, I also noticed that there is only one copy of that pg. I have no > idea how it happened, but my pools (all of them) got set to replication > size=1. I re-set them back to the intended values as soon as I noticed > it. Currently the pools are configured like this: > > # ceph osd pool ls detail > pool 0 'rbd' replicated size 2 min_size 2 crush_ruleset 0 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 349499 flags hashpspool > stripe_width 0 > removed_snaps [1~d] > pool 1 'cephfs_data' replicated size 2 min_size 2 crush_ruleset 0 > object_hash rjenkins pg_num 300 pgp_num 300 last_change 349490 lfor > 25902 flags hashpspool crash_replay_interval 45 tiers 4 read_tier 4 > write_tier 4 stripe_width 0 > pool 2 'cephfs_metadata' replicated size 2 min_size 2 crush_ruleset 0 > object_hash rjenkins pg_num 300 pgp_num 300 last_change 349503 flags > hashpspool stripe_width 0 > pool 4 'ssd_cache' replicated size 2 min_size 1 crush_ruleset 0 > object_hash rjenkins pg_num 256 pgp_num 256 last_change 349490 flags > hashpspool,incomplete_clones tier_of 1 cache_mode writeback target_bytes > 126701535232 target_objects 1000000 hit_set > bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s > x2 min_read_recency_for_promote 1 stripe_width 0 > > # ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -12 0.39999 root ssd_cache > -4 0.20000 host node11 > 8 0.20000 osd.8 up 1.00000 1.00000 > -11 0.20000 host node13 > 0 0.20000 osd.0 up 1.00000 1.00000 > -1 2.99997 root default > -7 0.29999 host node6 > 7 0.29999 osd.7 up 0.72400 1.00000 > -8 0.23000 host node5 > 5 0.23000 osd.5 up 0.67996 1.00000 > -6 0.45999 host node12 > 9 0.45999 osd.9 up 0.72157 1.00000 > -10 0.67000 host node14 > 10 0.67000 osd.10 up 0.70659 1.00000 > -13 0.67000 host node22 > 6 0.67000 osd.6 up 0.69070 1.00000 > -15 0.67000 host node21 > 11 0.67000 osd.11 up 0.69788 1.00000 > > -- > > For the most part data in my ceph cluster is not critical. Also, I have > a recent backup. At this point I would be happy to resolve the pg > problems "any way possible" in order to get it working again. Can I > just delete the problematic pg (or the versions of it that are broken)? > > I tried some commands to "accept the missing objects as lost" but it > tells me: > > # ceph pg 1.32 mark_unfound_lost delete > pg has no unfound objects > > The osd log for that is: > 2016-05-04 11:31:03.742453 9b088350 0 osd.6 350327 do_command r=0 > 2016-05-04 11:31:03.763017 9b088350 0 osd.6 350327 do_command r=0 pg > has no unfound objects > 2016-05-04 11:31:03.763066 9b088350 0 log_channel(cluster) log [INF] : > pg has no unfound objects > > > I also tried to "force create" the page: > # ceph pg force_create_pg 1.32 > pg 1.32 now creating, ok > > In that case, I do see a dispatch: > 2016-05-04 11:32:42.073625 mon.4 [INF] from='client.? > 192.168.2.224:0/208882728 <http://192.168.2.224:0/208882728>' > entity='client.admin' cmd=[{"prefix": "pg force_create_pg", "pgid": > "1.32"}]: dispatch > 2016-05-04 11:32:42.075024 mon.0 [INF] from='client.17514719 :/0' > entity='client.admin' cmd=[{"prefix": "pg force_create_pg", "pgid": > "1.32"}]: dispatch > 2016-05-04 11:32:42.183389 mon.0 [INF] from='client.17514719 :/0' > entity='client.admin' cmd='[{"prefix": "pg force_create_pg", "pgid": > "1.32"}]': finished > > That puts the page in a new state for a while: > # ceph health detail | grep 1.32 > pg 1.32 is stuck inactive since forever, current state creating, last > acting [] > pg 1.32 is stuck unclean since forever, current state creating, last > acting [] > > But after a few minutes it returns to the previous state: > > # ceph health detail | grep 1.32 > pg 1.32 is stuck inactive for 160741.831891, current state > undersized+degraded+peered, last acting [6] > pg 1.32 is stuck unclean for 1093042.263678, current state > undersized+degraded+peered, last acting [6] > pg 1.32 is stuck undersized for 57229.481051, current state > undersized+degraded+peered, last acting [6] > pg 1.32 is stuck degraded for 57229.481382, current state > undersized+degraded+peered, last acting [6] > pg 1.32 is undersized+degraded+peered, acting [6] > > Blade. > > > On Tue, May 3, 2016 at 10:45 AM, Oliver Dzombic <info@xxxxxxxxxxxxxxxxx > <mailto:info@xxxxxxxxxxxxxxxxx>> wrote: > > Hi Blade, > > if you dont see anything in the logs, then you should raise the debug > level/frequency. > > You must at least see, that the repair command has been issued ( > started ). > > Also i am wondering about the [6] from your output. > > That means, that there is only 1 copy of it ( on osd.6 ). > > What is your setting for the minimal required copies ? > > osd_pool_default_min_size = ?? > > And whats the setting for the to create copies ? > > osd_pool_default_size = ??? > > Please give us the output of > > ceph osd pool ls detail > > > > > -- > Mit freundlichen Gruessen / Best regards > > Oliver Dzombic > IP-Interactive > > mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx> > > Anschrift: > > IP Interactive UG ( haftungsbeschraenkt ) > Zum Sonnenberg 1-3 > 63571 Gelnhausen > > HRB 93402 beim Amtsgericht Hanau > Geschäftsführung: Oliver Dzombic > > Steuer Nr.: 35 236 3622 1 <tel:35%20236%203622%201> > UST ID: DE274086107 > > > Am 03.05.2016 um 19:11 schrieb Blade Doyle: > > Hi Oliver, > > > > Thanks for your reply. > > > > The problem could have been caused by crashing/flapping OSD's. The > > cluster is stable now, but lots of pg problems remain. > > > > $ ceph health > > HEALTH_ERR 4 pgs degraded; 158 pgs inconsistent; 4 pgs stuck > degraded; 1 > > pgs stuck inactive; 10 pgs stuck unclean; 4 pgs stuck undersized; > 4 pgs > > undersized; recovery 1489/523934 objects degraded (0.284%); recovery > > 2620/523934 objects misplaced (0.500%); 158 scrub errors > > > > Example: for pg 1.32 : > > > > $ ceph health detail | grep "pg 1.32" > > pg 1.32 is stuck inactive for 13260.118985, current state > > undersized+degraded+peered, last acting [6] > > pg 1.32 is stuck unclean for 945560.550800, current state > > undersized+degraded+peered, last acting [6] > > pg 1.32 is stuck undersized for 12855.304944, current state > > undersized+degraded+peered, last acting [6] > > pg 1.32 is stuck degraded for 12855.305305, current state > > undersized+degraded+peered, last acting [6] > > pg 1.32 is undersized+degraded+peered, acting [6] > > > > I tried various things like: > > > > $ ceph pg repair 1.32 > > instructing pg 1.32 on osd.6 to repair > > > > $ ceph pg deep-scrub 1.32 > > instructing pg 1.32 on osd.6 to deep-scrub > > > > Its odd that I never do see any log on osd.6 about scrubbing or > > repairing that pg (after waiting many hours). I attached "ceph pg > > query" and a grep of osd logs for that page. If there is a better way > > to provide large logs please let me know. > > > > For reference the last mention of that pg in the logs is: > > > > 2016-04-30 09:24:44.703785 975b9350 20 osd.6 349418 kicking pg 1.32 > > 2016-04-30 09:24:44.703880 975b9350 30 osd.6 pg_epoch: 349418 > pg[1.32( v > > 338815'7745 (20981'4727,338815'7745] local-les=349347 n=435 ec=17 > les/c > > 349347/349347 349418/349418/349418) [] r=-1 lpr=349418 > > pi=349346-349417/1 crt=338815'7743 lcod 0'0 inactive NOTIFY] lock > > > > > > Suggestions appreciated, > > Blade. > > > > > > > > > > On Sat, Apr 30, 2016 at 9:31 AM, Blade Doyle > <blade.doyle@xxxxxxxxx <mailto:blade.doyle@xxxxxxxxx> > > <mailto:blade.doyle@xxxxxxxxx <mailto:blade.doyle@xxxxxxxxx>>> wrote: > > > > Hi Ceph-Users, > > > > Help with how to resolve these would be appreciated. > > > > 2016-04-30 09:25:58.399634 9b809350 0 log_channel(cluster) log > > [INF] : 4.97 deep-scrub starts > > 2016-04-30 09:26:00.041962 93009350 0 -- > 192.168.2.52:6800/6640 <http://192.168.2.52:6800/6640> > > <http://192.168.2.52:6800/6640> >> 192.168.2.32:0/3983425916 > <http://192.168.2.32:0/3983425916> > > <http://192.168.2.32:0/3983425916> pipe(0x27406000 sd=111 > :6800 s=0 > > pgs=0 cs=0 l=0 c=0x272da0a0).accept peer addr is really > > 192.168.2.32:0/3983425916 <http://192.168.2.32:0/3983425916> > <http://192.168.2.32:0/3983425916> (socket > > is 192.168.2.32:38514/0 <http://192.168.2.32:38514/0> > <http://192.168.2.32:38514/0>) > > 2016-04-30 09:26:15.415883 9b809350 -1 log_channel(cluster) log > > [ERR] : 4.97 deep-scrub stat mismatch, got 284/282 objects, 0/0 > > clones, 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 > > whiteouts, 365855441/365855441 bytes,340/340 hit_set_archive > bytes. > > 2016-04-30 09:26:15.415953 9b809350 -1 log_channel(cluster) log > > [ERR] : 4.97 deep-scrub 1 errors > > 2016-04-30 09:26:15.416425 9b809350 0 log_channel(cluster) log > > [INF] : 4.97 scrub starts > > 2016-04-30 09:26:15.682311 9b809350 -1 log_channel(cluster) log > > [ERR] : 4.97 scrub stat mismatch, got 284/282 objects, 0/0 clones, > > 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts, > > 365855441/365855441 bytes,340/340 hit_set_archive bytes. > > 2016-04-30 09:26:15.682392 9b809350 -1 log_channel(cluster) log > > [ERR] : 4.97 scrub 1 errors > > > > Thanks Much, > > Blade. > > > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com