Re: Inconsistent PGs

M Ranga Swami Reddy <swamireddy@xxxxxxxxx> · Tue, 21 Jun 2016 16:07:37 +0530

Try to restart OSD 109 and 166? check if it help?

On Tue, Jun 21, 2016 at 4:05 PM, Paweł Sadowski <ceph@xxxxxxxxx> wrote:
> Thanks for response.
>
> All OSDs seems to be ok, they have been restarted, joined cluster after
> that, nothing weird in the logs.
>
> # ceph pg dump_stuck stale
> ok
>
> # ceph pg dump_stuck inactive
> ok
> pg_stat    state    up    up_primary    acting    acting_primary
> 3.2929    incomplete    [109,272,83]    109    [109,272,83]    109
> 3.1683    incomplete    [166,329,281]    166    [166,329,281]    166
>
> # ceph pg dump_stuck unclean
> ok
> pg_stat    state    up    up_primary    acting    acting_primary
> 3.2929    incomplete    [109,272,83]    109    [109,272,83]    109
> 3.1683    incomplete    [166,329,281]    166    [166,329,281]    166
>
>
> On OSD 166 there is 100 blocked ops (on 109 too), they all end on
> "event": "reached_pg"
>
> # ceph --admin-daemon /var/run/ceph/ceph-osd.166.asok dump_ops_in_flight
> ...
>         {
>             "description": "osd_op(client.958764031.0:18137113
> rbd_data.392585982ae8944a.0000000000000ad4 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 2641920~8192] 3.d6195683 RETRY=15
> ack+ondisk+retry+write+known_if_redirected e613241)",
>             "initiated_at": "2016-06-21 10:19:59.894393",
>             "age": 828.025527,
>             "duration": 600.020809,
>             "type_data": [
>                 "reached pg",
>                 {
>                     "client": "client.958764031",
>                     "tid": 18137113
>                 },
>                 [
>                     {
>                         "time": "2016-06-21 10:19:59.894393",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2016-06-21 10:29:59.915202",
>                         "event": "reached_pg"
>                     }
>                 ]
>             ]
>         }
>     ],
>     "num_ops": 100
> }
>
>
>
> On 06/21/2016 12:27 PM, M Ranga Swami Reddy wrote:
>> you can use the below cmds:
>> ==
>>
>> ceph pg dump_stuck stale
>> ceph pg dump_stuck inactive
>> ceph pg dump_stuck unclean
>> ===
>>
>> And the query the PG, which are in unclean or stale state, check for
>> any issue with a specific OSD.
>>
>> Thanks
>> Swami
>>
>> On Tue, Jun 21, 2016 at 3:02 PM, Paweł Sadowski <ceph@xxxxxxxxx> wrote:
>>> Hello,
>>>
>>> We have an issue on one of our clusters. One node with 9 OSD was down
>>> for more than 12 hours. During that time cluster recovered without
>>> problems. When host back to the cluster we got two PGs in incomplete
>>> state. We decided to mark OSDs on this host as out but the two PGs are
>>> still in incomplete state. Trying to query those pg hangs forever. We
>>> were alredy trying restarting OSDs. Is there any way to solve this issue
>>> without loosing data? Any help appreciate :)
>>>
>>> # ceph health detail | grep incomplete
>>> HEALTH_WARN 2 pgs incomplete; 2 pgs stuck inactive; 2 pgs stuck unclean;
>>> 200 requests are blocked > 32 sec; 2 osds have slow requests;
>>> noscrub,nodeep-scrub flag(s) set
>>> pg 3.2929 is stuck inactive since forever, current state incomplete,
>>> last acting [109,272,83]
>>> pg 3.1683 is stuck inactive since forever, current state incomplete,
>>> last acting [166,329,281]
>>> pg 3.2929 is stuck unclean since forever, current state incomplete, last
>>> acting [109,272,83]
>>> pg 3.1683 is stuck unclean since forever, current state incomplete, last
>>> acting [166,329,281]
>>> pg 3.1683 is incomplete, acting [166,329,281] (reducing pool vms
>>> min_size from 2 may help; search ceph.com/docs for 'incomplete')
>>> pg 3.2929 is incomplete, acting [109,272,83] (reducing pool vms min_size
>>> from 2 may help; search ceph.com/docs for 'incomplete')
>>>
>>> Directory for PG 3.1683 is present on OSD 166 and containes ~8GB.
>>>
>>> We didn't try setting min_size to 1 yet (we treat is as a last resort).
>>>
>>>
>>>
>>> Some cluster info:
>>> # ceph --version
>>>
>>> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>>>
>>> # ceph -s
>>>      health HEALTH_WARN
>>>             2 pgs incomplete
>>>             2 pgs stuck inactive
>>>             2 pgs stuck unclean
>>>             200 requests are blocked > 32 sec
>>>             noscrub,nodeep-scrub flag(s) set
>>>      monmap e7: 5 mons at
>>> {mon-03=*.2:6789/0,mon-04=*.36:6789/0,mon-05=*.81:6789/0,mon-06=*.0:6789/0,mon-07=*.40:6789/0}
>>>             election epoch 3250, quorum 0,1,2,3,4
>>> mon-06,mon-07,mon-04,mon-03,mon-05
>>>      osdmap e613040: 346 osds: 346 up, 337 in
>>>             flags noscrub,nodeep-scrub
>>>       pgmap v27163053: 18624 pgs, 6 pools, 138 TB data, 39062 kobjects
>>>             415 TB used, 186 TB / 601 TB avail
>>>                18622 active+clean
>>>                    2 incomplete
>>>   client io 9992 kB/s rd, 64867 kB/s wr, 8458 op/s
>>>
>>>
>>> # ceph osd pool get vms pg_num
>>> pg_num: 16384
>>>
>>> # ceph osd pool get vms size
>>> size: 3
>>>
>>> # ceph osd pool get vms min_size
>>> min_size: 2
>
> --
> PS
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com