hi greg, i attached the gzip output of the query and some more info below. if you need more, let me know. stijn > [root@mds01 ~]# ceph -s > cluster 92beef0a-1239-4000-bacf-4453ab630e47 > health HEALTH_ERR > 1 pgs inconsistent > 40 requests are blocked > 512 sec > 1 scrub errors > mds0: Behind on trimming (2793/30) > monmap e1: 3 mons at {mds01=1.2.3.4:6789/0,mds02=1.2.3.5:6789/0,mds03=1.2.3.6:6789/0} > election epoch 326, quorum 0,1,2 mds01,mds02,mds03 > fsmap e238677: 1/1/1 up {0=mds02=up:active}, 2 up:standby > osdmap e79554: 156 osds: 156 up, 156 in > flags sortbitwise,require_jewel_osds > pgmap v51003893: 4096 pgs, 3 pools, 387 TB data, 243 Mobjects > 545 TB used, 329 TB / 874 TB avail > 4091 active+clean > 4 active+clean+scrubbing+deep > 1 active+clean+inconsistent > client io 284 kB/s rd, 146 MB/s wr, 145 op/s rd, 177 op/s wr > cache io 115 MB/s flush, 153 MB/s evict, 14 op/s promote, 3 PG(s) flushing > [root@mds01 ~]# ceph health detail > HEALTH_ERR 1 pgs inconsistent; 52 requests are blocked > 512 sec; 5 osds have slow requests; 1 scrub errors; mds0: Behind on trimming (2782/30) > pg 5.5e3 is active+clean+inconsistent, acting [35,50,91,18,139,59,124,40,104,12,71] > 34 ops are blocked > 524.288 sec on osd.8 > 6 ops are blocked > 524.288 sec on osd.67 > 6 ops are blocked > 524.288 sec on osd.27 > 1 ops are blocked > 524.288 sec on osd.107 > 5 ops are blocked > 524.288 sec on osd.116 > 5 osds have slow requests > 1 scrub errors > mds0: Behind on trimming (2782/30)(max_segments: 30, num_segments: 2782) > # zgrep -C 1 ERR ceph-osd.35.log.*.gz > ceph-osd.35.log.5.gz:2017-10-14 11:25:52.260668 7f34d6748700 0 -- 10.141.16.13:6801/1001792 >> 1.2.3.11:6803/1951 pipe(0x56412da80800 sd=273 :6801 s=2 pgs=3176 cs=31 l=0 c=0x564156e83b00).fault with nothing to send, going to standby > ceph-osd.35.log.5.gz:2017-10-14 11:26:06.071011 7f3511be4700 -1 log_channel(cluster) log [ERR] : 5.5e3s0 shard 59(5) missing 5:c7ae919b:::10014d3184b.00000000:head > ceph-osd.35.log.5.gz:2017-10-14 11:28:36.465684 7f34ffdf5700 0 -- 1.2.3.13:6801/1001792 >> 1.2.3.21:6829/1834 pipe(0x56414e2a2000 sd=37 :6801 s=0 pgs=0 cs=0 l=0 c=0x5641470d2a00).accept connect_seq 33 vs existing 33 state standby > ceph-osd.35.log.5.gz:-- > ceph-osd.35.log.5.gz:2017-10-14 11:43:35.570711 7f3508efd700 0 -- 1.2.3.13:6801/1001792 >> 1.2.3.20:6825/1806 pipe(0x56413be34000 sd=138 :6801 s=2 pgs=2763 cs=45 l=0 c=0x564132999480).fault with nothing to send, going to standby > ceph-osd.35.log.5.gz:2017-10-14 11:44:02.235548 7f3511be4700 -1 log_channel(cluster) log [ERR] : 5.5e3s0 deep-scrub 1 missing, 0 inconsistent objects > ceph-osd.35.log.5.gz:2017-10-14 11:44:02.235554 7f3511be4700 -1 log_channel(cluster) log [ERR] : 5.5e3 deep-scrub 1 errors > ceph-osd.35.log.5.gz:2017-10-14 11:59:02.331454 7f34d6d4e700 0 -- 1.2.3.13:6801/1001792 >> 1.2.3.11:6817/1941 pipe(0x56414d370800 sd=227 :42104 s=2 pgs=3238 cs=89 l=0 c=0x56413122d200).fault with nothing to send, going to standby On 10/18/2017 10:19 PM, Gregory Farnum wrote: > It would help if you can provide the exact output of "ceph -s", "pg query", > and any other relevant data. You shouldn't need to do manual repair of > erasure-coded pools, since it has checksums and can tell which bits are > bad. Following that article may not have done you any good (though I > wouldn't expect it to hurt, either...)... > -Greg > > On Wed, Oct 18, 2017 at 5:56 AM Stijn De Weirdt <stijn.deweirdt@xxxxxxxx> > wrote: > >> hi all, >> >> we have a ceph 10.2.7 cluster with a 8+3 EC pool. >> in that pool, there is a pg in inconsistent state. >> >> we followed http://ceph.com/geen-categorie/ceph-manually-repair-object/, >> however, we are unable to solve our issue. >> >> from the primary osd logs, the reported pg had a missing object. >> >> we found a related object on the primary osd, and then looked for >> similar ones on the other osds in same path (i guess it is just has the >> index of the osd in the pg list of osds suffixed) >> >> one osd did not have such a file (the 10 others did). >> >> so we did the "stop osd/flush/start os/pg repair" on both the primary >> osd and on the osd with the missing EC part. >> >> however, the scrub error still exists. >> >> does anyone have any hints what to do in this case? >> >> stijn >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
Attachment:
query_5.5e3.gz
Description: application/gzip
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com