Try to restart OSD 109 and 166? check if it help? On Tue, Jun 21, 2016 at 4:05 PM, Paweł Sadowski <ceph@xxxxxxxxx> wrote: > Thanks for response. > > All OSDs seems to be ok, they have been restarted, joined cluster after > that, nothing weird in the logs. > > # ceph pg dump_stuck stale > ok > > # ceph pg dump_stuck inactive > ok > pg_stat state up up_primary acting acting_primary > 3.2929 incomplete [109,272,83] 109 [109,272,83] 109 > 3.1683 incomplete [166,329,281] 166 [166,329,281] 166 > > # ceph pg dump_stuck unclean > ok > pg_stat state up up_primary acting acting_primary > 3.2929 incomplete [109,272,83] 109 [109,272,83] 109 > 3.1683 incomplete [166,329,281] 166 [166,329,281] 166 > > > On OSD 166 there is 100 blocked ops (on 109 too), they all end on > "event": "reached_pg" > > # ceph --admin-daemon /var/run/ceph/ceph-osd.166.asok dump_ops_in_flight > ... > { > "description": "osd_op(client.958764031.0:18137113 > rbd_data.392585982ae8944a.0000000000000ad4 [set-alloc-hint object_size > 4194304 write_size 4194304,write 2641920~8192] 3.d6195683 RETRY=15 > ack+ondisk+retry+write+known_if_redirected e613241)", > "initiated_at": "2016-06-21 10:19:59.894393", > "age": 828.025527, > "duration": 600.020809, > "type_data": [ > "reached pg", > { > "client": "client.958764031", > "tid": 18137113 > }, > [ > { > "time": "2016-06-21 10:19:59.894393", > "event": "initiated" > }, > { > "time": "2016-06-21 10:29:59.915202", > "event": "reached_pg" > } > ] > ] > } > ], > "num_ops": 100 > } > > > > On 06/21/2016 12:27 PM, M Ranga Swami Reddy wrote: >> you can use the below cmds: >> == >> >> ceph pg dump_stuck stale >> ceph pg dump_stuck inactive >> ceph pg dump_stuck unclean >> === >> >> And the query the PG, which are in unclean or stale state, check for >> any issue with a specific OSD. >> >> Thanks >> Swami >> >> On Tue, Jun 21, 2016 at 3:02 PM, Paweł Sadowski <ceph@xxxxxxxxx> wrote: >>> Hello, >>> >>> We have an issue on one of our clusters. One node with 9 OSD was down >>> for more than 12 hours. During that time cluster recovered without >>> problems. When host back to the cluster we got two PGs in incomplete >>> state. We decided to mark OSDs on this host as out but the two PGs are >>> still in incomplete state. Trying to query those pg hangs forever. We >>> were alredy trying restarting OSDs. Is there any way to solve this issue >>> without loosing data? Any help appreciate :) >>> >>> # ceph health detail | grep incomplete >>> HEALTH_WARN 2 pgs incomplete; 2 pgs stuck inactive; 2 pgs stuck unclean; >>> 200 requests are blocked > 32 sec; 2 osds have slow requests; >>> noscrub,nodeep-scrub flag(s) set >>> pg 3.2929 is stuck inactive since forever, current state incomplete, >>> last acting [109,272,83] >>> pg 3.1683 is stuck inactive since forever, current state incomplete, >>> last acting [166,329,281] >>> pg 3.2929 is stuck unclean since forever, current state incomplete, last >>> acting [109,272,83] >>> pg 3.1683 is stuck unclean since forever, current state incomplete, last >>> acting [166,329,281] >>> pg 3.1683 is incomplete, acting [166,329,281] (reducing pool vms >>> min_size from 2 may help; search ceph.com/docs for 'incomplete') >>> pg 3.2929 is incomplete, acting [109,272,83] (reducing pool vms min_size >>> from 2 may help; search ceph.com/docs for 'incomplete') >>> >>> Directory for PG 3.1683 is present on OSD 166 and containes ~8GB. >>> >>> We didn't try setting min_size to 1 yet (we treat is as a last resort). >>> >>> >>> >>> Some cluster info: >>> # ceph --version >>> >>> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) >>> >>> # ceph -s >>> health HEALTH_WARN >>> 2 pgs incomplete >>> 2 pgs stuck inactive >>> 2 pgs stuck unclean >>> 200 requests are blocked > 32 sec >>> noscrub,nodeep-scrub flag(s) set >>> monmap e7: 5 mons at >>> {mon-03=*.2:6789/0,mon-04=*.36:6789/0,mon-05=*.81:6789/0,mon-06=*.0:6789/0,mon-07=*.40:6789/0} >>> election epoch 3250, quorum 0,1,2,3,4 >>> mon-06,mon-07,mon-04,mon-03,mon-05 >>> osdmap e613040: 346 osds: 346 up, 337 in >>> flags noscrub,nodeep-scrub >>> pgmap v27163053: 18624 pgs, 6 pools, 138 TB data, 39062 kobjects >>> 415 TB used, 186 TB / 601 TB avail >>> 18622 active+clean >>> 2 incomplete >>> client io 9992 kB/s rd, 64867 kB/s wr, 8458 op/s >>> >>> >>> # ceph osd pool get vms pg_num >>> pg_num: 16384 >>> >>> # ceph osd pool get vms size >>> size: 3 >>> >>> # ceph osd pool get vms min_size >>> min_size: 2 > > -- > PS _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com