David, does this look familiar? -Sam On Fri, Aug 28, 2015 at 10:43 AM, Aaron Ten Clay <aarontc@xxxxxxxxxxx> wrote: > Hi Cephers, > > I'm trying to resolve an inconsistent pg on an erasure-coded pool, running > Ceph 9.0.2. I can't seem to get Ceph to run a repair or even deep-scrub the > pg again. Here's the background, with my attempted resolution steps below. > Hopefully someone can steer me in the right direction. Thanks in advance! > > Current state: > # ceph health detail > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set > pg 2.36 is active+clean+inconsistent, acting > [1,21,12,9,0,10,14,7,18,20,5,4,22,16] > 1 scrub errors > noout flag(s) set > > I started by looking at the log file for osd.1, where I found the cause of > the inconsistent report: > > 2015-08-24 00:43:10.391621 7f09fcff9700 0 log_channel(cluster) log [INF] : > 2.36 deep-scrub starts > 2015-08-24 01:54:59.933532 7f09fcff9700 -1 log_channel(cluster) log [ERR] : > 2.36s0 shard 21(1): soid 576340b6/10000005990.00000199/head//2 candidate had > a read error > 2015-08-24 02:34:41.380740 7f09fcff9700 -1 log_channel(cluster) log [ERR] : > 2.36s0 deep-scrub 0 missing, 1 inconsistent objects > 2015-08-24 02:34:41.380757 7f09fcff9700 -1 log_channel(cluster) log [ERR] : > 2.36 deep-scrub 1 errors > > I checked osd.21, where this report appears: > > 2015-08-24 01:54:56.477020 7f707cbd4700 0 osd.21 pg_epoch: 31958 pg[2.36s1( > v 31957'43013 (7132'39997,31957'43013] local-les=31951 n=34556 ec=136 les/c > 31951/31954 31945/31945/31924) [1,21,12,9,0,10,14,7,18,20,5,4,22,16] r=1 > lpr=31945 pi=1131-31944/7827 luod=0'0 crt=31957'43011 active] _scan_list > 576340b6/10000005990.00000199/head//2 got incorrect hash on read > > So, based upon the ceph documentation, I thought I could repair the pg by > executing "ceph pg repair 2.36". When I run this, while watching the mon > log, I see the command dispatch: > > 2015-08-28 10:14:17.964017 mon.0 [INF] from='client.? 10.42.5.61:0/1002181' > entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "2.36"}]: > dispatch > > But I never see a "finish" in the mon log, like most ceph commands return. > (Not sure if I should expect to see a finish, just noting it doesn't occur.) > > Also, tailing the logs for any OSD in the acting set for pg 2.36, I never > see anything about a repair. The same case holds when I try "ceph pg 2.36 > deep-scrub" - command dispatched, but none of the OSDs care. In the past on > other clusters, I've seen "[INF] : pg.id repair starts" messages in the OSD > log after executing "ceph pg nn.yy repair". > > Further confusing me, I do see osd.1 start and finish other pg deep-scrubs, > before and after executing "ceph pg 2.36 deep-scrub". > > I know EC pools are special in several ways, but nothing in the Ceph manual > seems to indicate I can't deep-scrub or repair pgs in an EC pool... > > Thanks for reading and any suggestions. I'm happy to provide complete log > files or more details if I've left out any information that could be > helpful. > > ceph -s: http://hastebin.com/xetohugibi > ceph pg dump: http://hastebin.com/bijehoheve > ceph -v: ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0) > ceph osd dump: http://hastebin.com/fitajuzeca > > -Aaron > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com