Already restarted those OSD and then whole cluster (rack by rack, failure domain is rack in this setup). We would like to try *ceph-objectstore-tool mark-complete* operation. Is there any way (other than checking mtime on file and querying PGs) to determine which replica has most up to date datas? On 06/21/2016 12:37 PM, M Ranga Swami Reddy wrote: > Try to restart OSD 109 and 166? check if it help? > > > On Tue, Jun 21, 2016 at 4:05 PM, Paweł Sadowski <ceph@xxxxxxxxx> wrote: >> Thanks for response. >> >> All OSDs seems to be ok, they have been restarted, joined cluster after >> that, nothing weird in the logs. >> >> # ceph pg dump_stuck stale >> ok >> >> # ceph pg dump_stuck inactive >> ok >> pg_stat state up up_primary acting acting_primary >> 3.2929 incomplete [109,272,83] 109 [109,272,83] 109 >> 3.1683 incomplete [166,329,281] 166 [166,329,281] 166 >> >> # ceph pg dump_stuck unclean >> ok >> pg_stat state up up_primary acting acting_primary >> 3.2929 incomplete [109,272,83] 109 [109,272,83] 109 >> 3.1683 incomplete [166,329,281] 166 [166,329,281] 166 >> >> >> On OSD 166 there is 100 blocked ops (on 109 too), they all end on >> "event": "reached_pg" >> >> # ceph --admin-daemon /var/run/ceph/ceph-osd.166.asok dump_ops_in_flight >> ... >> { >> "description": "osd_op(client.958764031.0:18137113 >> rbd_data.392585982ae8944a.0000000000000ad4 [set-alloc-hint object_size >> 4194304 write_size 4194304,write 2641920~8192] 3.d6195683 RETRY=15 >> ack+ondisk+retry+write+known_if_redirected e613241)", >> "initiated_at": "2016-06-21 10:19:59.894393", >> "age": 828.025527, >> "duration": 600.020809, >> "type_data": [ >> "reached pg", >> { >> "client": "client.958764031", >> "tid": 18137113 >> }, >> [ >> { >> "time": "2016-06-21 10:19:59.894393", >> "event": "initiated" >> }, >> { >> "time": "2016-06-21 10:29:59.915202", >> "event": "reached_pg" >> } >> ] >> ] >> } >> ], >> "num_ops": 100 >> } >> >> >> >> On 06/21/2016 12:27 PM, M Ranga Swami Reddy wrote: >>> you can use the below cmds: >>> == >>> >>> ceph pg dump_stuck stale >>> ceph pg dump_stuck inactive >>> ceph pg dump_stuck unclean >>> === >>> >>> And the query the PG, which are in unclean or stale state, check for >>> any issue with a specific OSD. >>> >>> Thanks >>> Swami >>> >>> On Tue, Jun 21, 2016 at 3:02 PM, Paweł Sadowski <ceph@xxxxxxxxx> wrote: >>>> Hello, >>>> >>>> We have an issue on one of our clusters. One node with 9 OSD was down >>>> for more than 12 hours. During that time cluster recovered without >>>> problems. When host back to the cluster we got two PGs in incomplete >>>> state. We decided to mark OSDs on this host as out but the two PGs are >>>> still in incomplete state. Trying to query those pg hangs forever. We >>>> were alredy trying restarting OSDs. Is there any way to solve this issue >>>> without loosing data? Any help appreciate :) >>>> >>>> # ceph health detail | grep incomplete >>>> HEALTH_WARN 2 pgs incomplete; 2 pgs stuck inactive; 2 pgs stuck unclean; >>>> 200 requests are blocked > 32 sec; 2 osds have slow requests; >>>> noscrub,nodeep-scrub flag(s) set >>>> pg 3.2929 is stuck inactive since forever, current state incomplete, >>>> last acting [109,272,83] >>>> pg 3.1683 is stuck inactive since forever, current state incomplete, >>>> last acting [166,329,281] >>>> pg 3.2929 is stuck unclean since forever, current state incomplete, last >>>> acting [109,272,83] >>>> pg 3.1683 is stuck unclean since forever, current state incomplete, last >>>> acting [166,329,281] >>>> pg 3.1683 is incomplete, acting [166,329,281] (reducing pool vms >>>> min_size from 2 may help; search ceph.com/docs for 'incomplete') >>>> pg 3.2929 is incomplete, acting [109,272,83] (reducing pool vms min_size >>>> from 2 may help; search ceph.com/docs for 'incomplete') >>>> >>>> Directory for PG 3.1683 is present on OSD 166 and containes ~8GB. >>>> >>>> We didn't try setting min_size to 1 yet (we treat is as a last resort). >>>> >>>> >>>> >>>> Some cluster info: >>>> # ceph --version >>>> >>>> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) >>>> >>>> # ceph -s >>>> health HEALTH_WARN >>>> 2 pgs incomplete >>>> 2 pgs stuck inactive >>>> 2 pgs stuck unclean >>>> 200 requests are blocked > 32 sec >>>> noscrub,nodeep-scrub flag(s) set >>>> monmap e7: 5 mons at >>>> {mon-03=*.2:6789/0,mon-04=*.36:6789/0,mon-05=*.81:6789/0,mon-06=*.0:6789/0,mon-07=*.40:6789/0} >>>> election epoch 3250, quorum 0,1,2,3,4 >>>> mon-06,mon-07,mon-04,mon-03,mon-05 >>>> osdmap e613040: 346 osds: 346 up, 337 in >>>> flags noscrub,nodeep-scrub >>>> pgmap v27163053: 18624 pgs, 6 pools, 138 TB data, 39062 kobjects >>>> 415 TB used, 186 TB / 601 TB avail >>>> 18622 active+clean >>>> 2 incomplete >>>> client io 9992 kB/s rd, 64867 kB/s wr, 8458 op/s >>>> >>>> >>>> # ceph osd pool get vms pg_num >>>> pg_num: 16384 >>>> >>>> # ceph osd pool get vms size >>>> size: 3 >>>> >>>> # ceph osd pool get vms min_size >>>> min_size: 2 >> -- >> PS _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com