Query on that PGs hangs forever. We ended up using *ceph-objectstore-tool**mark-complete* on those PGs. On 06/22/2016 11:45 AM, 施柏安 wrote: > Hi, > You can use command 'ceph pg query' to check what's going on with the > pgs which have problem and use "ceph-objectstore-tool" to recover that pg. > > 2016-06-21 19:09 GMT+08:00 Paweł Sadowski <ceph@xxxxxxxxx > <mailto:ceph@xxxxxxxxx>>: > > Already restarted those OSD and then whole cluster (rack by rack, > failure domain is rack in this setup). > We would like to try *ceph-objectstore-tool mark-complete* > operation. Is > there any way (other than checking mtime on file and querying PGs) to > determine which replica has most up to date datas? > > On 06/21/2016 12:37 PM, M Ranga Swami Reddy wrote: > > Try to restart OSD 109 and 166? check if it help? > > > > > > On Tue, Jun 21, 2016 at 4:05 PM, Paweł Sadowski <ceph@xxxxxxxxx > <mailto:ceph@xxxxxxxxx>> wrote: > >> Thanks for response. > >> > >> All OSDs seems to be ok, they have been restarted, joined > cluster after > >> that, nothing weird in the logs. > >> > >> # ceph pg dump_stuck stale > >> ok > >> > >> # ceph pg dump_stuck inactive > >> ok > >> pg_stat state up up_primary acting acting_primary > >> 3.2929 incomplete [109,272,83] 109 [109,272,83] 109 > >> 3.1683 incomplete [166,329,281] 166 [166,329,281] > 166 > >> > >> # ceph pg dump_stuck unclean > >> ok > >> pg_stat state up up_primary acting acting_primary > >> 3.2929 incomplete [109,272,83] 109 [109,272,83] 109 > >> 3.1683 incomplete [166,329,281] 166 [166,329,281] > 166 > >> > >> > >> On OSD 166 there is 100 blocked ops (on 109 too), they all end on > >> "event": "reached_pg" > >> > >> # ceph --admin-daemon /var/run/ceph/ceph-osd.166.asok > dump_ops_in_flight > >> ... > >> { > >> "description": "osd_op(client.958764031.0:18137113 > >> rbd_data.392585982ae8944a.0000000000000ad4 [set-alloc-hint > object_size > >> 4194304 write_size 4194304,write 2641920~8192] 3.d6195683 RETRY=15 > >> ack+ondisk+retry+write+known_if_redirected e613241)", > >> "initiated_at": "2016-06-21 10:19:59.894393", > >> "age": 828.025527, > >> "duration": 600.020809, > >> "type_data": [ > >> "reached pg", > >> { > >> "client": "client.958764031", > >> "tid": 18137113 > >> }, > >> [ > >> { > >> "time": "2016-06-21 10:19:59.894393", > >> "event": "initiated" > >> }, > >> { > >> "time": "2016-06-21 10:29:59.915202", > >> "event": "reached_pg" > >> } > >> ] > >> ] > >> } > >> ], > >> "num_ops": 100 > >> } > >> > >> > >> > >> On 06/21/2016 12:27 PM, M Ranga Swami Reddy wrote: > >>> you can use the below cmds: > >>> == > >>> > >>> ceph pg dump_stuck stale > >>> ceph pg dump_stuck inactive > >>> ceph pg dump_stuck unclean > >>> === > >>> > >>> And the query the PG, which are in unclean or stale state, > check for > >>> any issue with a specific OSD. > >>> > >>> Thanks > >>> Swami > >>> > >>> On Tue, Jun 21, 2016 at 3:02 PM, Paweł Sadowski > <ceph@xxxxxxxxx <mailto:ceph@xxxxxxxxx>> wrote: > >>>> Hello, > >>>> > >>>> We have an issue on one of our clusters. One node with 9 OSD > was down > >>>> for more than 12 hours. During that time cluster recovered > without > >>>> problems. When host back to the cluster we got two PGs in > incomplete > >>>> state. We decided to mark OSDs on this host as out but the > two PGs are > >>>> still in incomplete state. Trying to query those pg hangs > forever. We > >>>> were alredy trying restarting OSDs. Is there any way to solve > this issue > >>>> without loosing data? Any help appreciate :) > >>>> > >>>> # ceph health detail | grep incomplete > >>>> HEALTH_WARN 2 pgs incomplete; 2 pgs stuck inactive; 2 pgs > stuck unclean; > >>>> 200 requests are blocked > 32 sec; 2 osds have slow requests; > >>>> noscrub,nodeep-scrub flag(s) set > >>>> pg 3.2929 is stuck inactive since forever, current state > incomplete, > >>>> last acting [109,272,83] > >>>> pg 3.1683 is stuck inactive since forever, current state > incomplete, > >>>> last acting [166,329,281] > >>>> pg 3.2929 is stuck unclean since forever, current state > incomplete, last > >>>> acting [109,272,83] > >>>> pg 3.1683 is stuck unclean since forever, current state > incomplete, last > >>>> acting [166,329,281] > >>>> pg 3.1683 is incomplete, acting [166,329,281] (reducing pool vms > >>>> min_size from 2 may help; search ceph.com/docs > <http://ceph.com/docs> for 'incomplete') > >>>> pg 3.2929 is incomplete, acting [109,272,83] (reducing pool > vms min_size > >>>> from 2 may help; search ceph.com/docs <http://ceph.com/docs> > for 'incomplete') > >>>> > >>>> Directory for PG 3.1683 is present on OSD 166 and containes ~8GB. > >>>> > >>>> We didn't try setting min_size to 1 yet (we treat is as a > last resort). > >>>> > >>>> > >>>> > >>>> Some cluster info: > >>>> # ceph --version > >>>> > >>>> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) > >>>> > >>>> # ceph -s > >>>> health HEALTH_WARN > >>>> 2 pgs incomplete > >>>> 2 pgs stuck inactive > >>>> 2 pgs stuck unclean > >>>> 200 requests are blocked > 32 sec > >>>> noscrub,nodeep-scrub flag(s) set > >>>> monmap e7: 5 mons at > >>>> > {mon-03=*.2:6789/0,mon-04=*.36:6789/0,mon-05=*.81:6789/0,mon-06=*.0:6789/0,mon-07=*.40:6789/0} > >>>> election epoch 3250, quorum 0,1,2,3,4 > >>>> mon-06,mon-07,mon-04,mon-03,mon-05 > >>>> osdmap e613040: 346 osds: 346 up, 337 in > >>>> flags noscrub,nodeep-scrub > >>>> pgmap v27163053: 18624 pgs, 6 pools, 138 TB data, 39062 > kobjects > >>>> 415 TB used, 186 TB / 601 TB avail > >>>> 18622 active+clean > >>>> 2 incomplete > >>>> client io 9992 kB/s rd, 64867 kB/s wr, 8458 op/s > >>>> > >>>> > >>>> # ceph osd pool get vms pg_num > >>>> pg_num: 16384 > >>>> > >>>> # ceph osd pool get vms size > >>>> size: 3 > >>>> > >>>> # ceph osd pool get vms min_size > >>>> min_size: 2 > -- PS _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com