Re: PGs stuck active+remapped and osds lose data?!

Marcus Müller <mueller.marcus@xxxxxxxxx> · Tue, 10 Jan 2017 08:08:57 +0100

Ok, i understand but how can I debug why they are not running as they should? For me I thought everything is fine because ceph -s said they are up and running. 

I would think of a problem with the crush map. 

> Am 10.01.2017 um 08:06 schrieb Shinobu Kinjo <skinjo@xxxxxxxxxx>:
> 
> e.g.,
> OSD7 / 3 / 0 are in the same acting set. They should be up, if they
> are properly running.
> 
> # 9.7
> <snip>
>>   "up": [
>>       7,
>>       3
>>   ],
>>   "acting": [
>>       7,
>>       3,
>>       0
>>   ],
> <snip>
> 
> Here is an example:
> 
>  "up": [
>    1,
>    0,
>    2
>  ],
>  "acting": [
>    1,
>    0,
>    2
>   ],
> 
> Regards,
> 
> 
> On Tue, Jan 10, 2017 at 3:52 PM, Marcus Müller <mueller.marcus@xxxxxxxxx> wrote:
>>> 
>>> That's not perfectly correct.
>>> 
>>> OSD.0/1/2 seem to be down.
>> 
>> 
>> Sorry but where do you see this? I think this indicates that they are up:   osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs?
>> 
>> 
>>> Am 10.01.2017 um 07:50 schrieb Shinobu Kinjo <skinjo@xxxxxxxxxx>:
>>> 
>>> On Tue, Jan 10, 2017 at 3:44 PM, Marcus Müller <mueller.marcus@xxxxxxxxx> wrote:
>>>> All osds are currently up:
>>>> 
>>>>    health HEALTH_WARN
>>>>           4 pgs stuck unclean
>>>>           recovery 4482/58798254 objects degraded (0.008%)
>>>>           recovery 420522/58798254 objects misplaced (0.715%)
>>>>           noscrub,nodeep-scrub flag(s) set
>>>>    monmap e9: 5 mons at
>>>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>>>           election epoch 478, quorum 0,1,2,3,4
>>>> ceph1,ceph2,ceph3,ceph4,ceph5
>>>>    osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>>>           flags noscrub,nodeep-scrub
>>>>     pgmap v9981077: 320 pgs, 3 pools, 4837 GB data, 19140 kobjects
>>>>           15070 GB used, 40801 GB / 55872 GB avail
>>>>           4482/58798254 objects degraded (0.008%)
>>>>           420522/58798254 objects misplaced (0.715%)
>>>>                316 active+clean
>>>>                  4 active+remapped
>>>> client io 56601 B/s rd, 45619 B/s wr, 0 op/s
>>>> 
>>>> This did not chance for two days or so.
>>>> 
>>>> 
>>>> By the way, my ceph osd df now looks like this:
>>>> 
>>>> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
>>>> 0 1.28899  1.00000  3724G  1699G  2024G 45.63 1.69
>>>> 1 1.57899  1.00000  3724G  1708G  2015G 45.87 1.70
>>>> 2 1.68900  1.00000  3724G  1695G  2028G 45.54 1.69
>>>> 3 6.78499  1.00000  7450G  1241G  6208G 16.67 0.62
>>>> 4 8.39999  1.00000  7450G  1228G  6221G 16.49 0.61
>>>> 5 9.51500  1.00000  7450G  1239G  6210G 16.64 0.62
>>>> 6 7.66499  1.00000  7450G  1265G  6184G 16.99 0.63
>>>> 7 9.75499  1.00000  7450G  2497G  4952G 33.52 1.24
>>>> 8 9.32999  1.00000  7450G  2495G  4954G 33.49 1.24
>>>>             TOTAL 55872G 15071G 40801G 26.97
>>>> MIN/MAX VAR: 0.61/1.70  STDDEV: 13.16
>>>> 
>>>> As you can see, now osd2 also went down to 45% Use and „lost“ data. But I
>>>> also think this is no problem and ceph just clears everything up after
>>>> backfilling.
>>>> 
>>>> 
>>>> Am 10.01.2017 um 07:29 schrieb Shinobu Kinjo <skinjo@xxxxxxxxxx>:
>>>> 
>>>> Looking at ``ceph -s`` you originally provided, all OSDs are up.
>>>> 
>>>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>>> 
>>>> 
>>>> But looking at ``pg query``, OSD.0 / 1 are not up. Are they something
>>> 
>>> That's not perfectly correct.
>>> 
>>> OSD.0/1/2 seem to be down.
>>> 
>>>> like related to ?:
>>>> 
>>>> Ceph1, ceph2 and ceph3 are vms on one physical host
>>>> 
>>>> 
>>>> Are those OSDs running on vm instances?
>>>> 
>>>> # 9.7
>>>> <snip>
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>>     7,
>>>>     3
>>>> ],
>>>> "acting": [
>>>>     7,
>>>>     3,
>>>>     0
>>>> ],
>>>> 
>>>> <snip>
>>>> 
>>>> # 7.84
>>>> <snip>
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>>     4,
>>>>     8
>>>> ],
>>>> "acting": [
>>>>     4,
>>>>     8,
>>>>     1
>>>> ],
>>>> 
>>>> <snip>
>>>> 
>>>> # 8.1b
>>>> <snip>
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>>     4,
>>>>     7
>>>> ],
>>>> "acting": [
>>>>     4,
>>>>     7,
>>>>     2
>>>> ],
>>>> 
>>>> <snip>
>>>> 
>>>> # 7.7a
>>>> <snip>
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>>     7,
>>>>     4
>>>> ],
>>>> "acting": [
>>>>     7,
>>>>     4,
>>>>     2
>>>> ],
>>>> 
>>>> <snip>
>>>> 
>>>> 
>> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com