Yeah, Sam is correct. I've not looked at crushmap. But I should have noticed what troublesome is with looking at `ceph osd tree`. That's my bad, sorry for that. Again please refer to: http://www.anchor.com.au/blog/2013/02/pulling-apart-cephs-crush-algorithm/ Regards, On Wed, Jan 11, 2017 at 1:50 AM, Samuel Just <sjust@xxxxxxxxxx> wrote: > Shinobu isn't correct, you have 9/9 osds up and running. up does not > equal acting because crush is having trouble fulfilling the weights in > your crushmap and the acting set is being padded out with an extra osd > which happens to have the data to keep you up to the right number of > replicas. Please refer back to Brad's post. > -Sam > > On Mon, Jan 9, 2017 at 11:08 PM, Marcus Müller <mueller.marcus@xxxxxxxxx> wrote: >> Ok, i understand but how can I debug why they are not running as they should? For me I thought everything is fine because ceph -s said they are up and running. >> >> I would think of a problem with the crush map. >> >>> Am 10.01.2017 um 08:06 schrieb Shinobu Kinjo <skinjo@xxxxxxxxxx>: >>> >>> e.g., >>> OSD7 / 3 / 0 are in the same acting set. They should be up, if they >>> are properly running. >>> >>> # 9.7 >>> <snip> >>>> "up": [ >>>> 7, >>>> 3 >>>> ], >>>> "acting": [ >>>> 7, >>>> 3, >>>> 0 >>>> ], >>> <snip> >>> >>> Here is an example: >>> >>> "up": [ >>> 1, >>> 0, >>> 2 >>> ], >>> "acting": [ >>> 1, >>> 0, >>> 2 >>> ], >>> >>> Regards, >>> >>> >>> On Tue, Jan 10, 2017 at 3:52 PM, Marcus Müller <mueller.marcus@xxxxxxxxx> wrote: >>>>> >>>>> That's not perfectly correct. >>>>> >>>>> OSD.0/1/2 seem to be down. >>>> >>>> >>>> Sorry but where do you see this? I think this indicates that they are up: osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs? >>>> >>>> >>>>> Am 10.01.2017 um 07:50 schrieb Shinobu Kinjo <skinjo@xxxxxxxxxx>: >>>>> >>>>> On Tue, Jan 10, 2017 at 3:44 PM, Marcus Müller <mueller.marcus@xxxxxxxxx> wrote: >>>>>> All osds are currently up: >>>>>> >>>>>> health HEALTH_WARN >>>>>> 4 pgs stuck unclean >>>>>> recovery 4482/58798254 objects degraded (0.008%) >>>>>> recovery 420522/58798254 objects misplaced (0.715%) >>>>>> noscrub,nodeep-scrub flag(s) set >>>>>> monmap e9: 5 mons at >>>>>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0} >>>>>> election epoch 478, quorum 0,1,2,3,4 >>>>>> ceph1,ceph2,ceph3,ceph4,ceph5 >>>>>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs >>>>>> flags noscrub,nodeep-scrub >>>>>> pgmap v9981077: 320 pgs, 3 pools, 4837 GB data, 19140 kobjects >>>>>> 15070 GB used, 40801 GB / 55872 GB avail >>>>>> 4482/58798254 objects degraded (0.008%) >>>>>> 420522/58798254 objects misplaced (0.715%) >>>>>> 316 active+clean >>>>>> 4 active+remapped >>>>>> client io 56601 B/s rd, 45619 B/s wr, 0 op/s >>>>>> >>>>>> This did not chance for two days or so. >>>>>> >>>>>> >>>>>> By the way, my ceph osd df now looks like this: >>>>>> >>>>>> ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR >>>>>> 0 1.28899 1.00000 3724G 1699G 2024G 45.63 1.69 >>>>>> 1 1.57899 1.00000 3724G 1708G 2015G 45.87 1.70 >>>>>> 2 1.68900 1.00000 3724G 1695G 2028G 45.54 1.69 >>>>>> 3 6.78499 1.00000 7450G 1241G 6208G 16.67 0.62 >>>>>> 4 8.39999 1.00000 7450G 1228G 6221G 16.49 0.61 >>>>>> 5 9.51500 1.00000 7450G 1239G 6210G 16.64 0.62 >>>>>> 6 7.66499 1.00000 7450G 1265G 6184G 16.99 0.63 >>>>>> 7 9.75499 1.00000 7450G 2497G 4952G 33.52 1.24 >>>>>> 8 9.32999 1.00000 7450G 2495G 4954G 33.49 1.24 >>>>>> TOTAL 55872G 15071G 40801G 26.97 >>>>>> MIN/MAX VAR: 0.61/1.70 STDDEV: 13.16 >>>>>> >>>>>> As you can see, now osd2 also went down to 45% Use and „lost“ data. But I >>>>>> also think this is no problem and ceph just clears everything up after >>>>>> backfilling. >>>>>> >>>>>> >>>>>> Am 10.01.2017 um 07:29 schrieb Shinobu Kinjo <skinjo@xxxxxxxxxx>: >>>>>> >>>>>> Looking at ``ceph -s`` you originally provided, all OSDs are up. >>>>>> >>>>>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs >>>>>> >>>>>> >>>>>> But looking at ``pg query``, OSD.0 / 1 are not up. Are they something >>>>> >>>>> That's not perfectly correct. >>>>> >>>>> OSD.0/1/2 seem to be down. >>>>> >>>>>> like related to ?: >>>>>> >>>>>> Ceph1, ceph2 and ceph3 are vms on one physical host >>>>>> >>>>>> >>>>>> Are those OSDs running on vm instances? >>>>>> >>>>>> # 9.7 >>>>>> <snip> >>>>>> >>>>>> "state": "active+remapped", >>>>>> "snap_trimq": "[]", >>>>>> "epoch": 3114, >>>>>> "up": [ >>>>>> 7, >>>>>> 3 >>>>>> ], >>>>>> "acting": [ >>>>>> 7, >>>>>> 3, >>>>>> 0 >>>>>> ], >>>>>> >>>>>> <snip> >>>>>> >>>>>> # 7.84 >>>>>> <snip> >>>>>> >>>>>> "state": "active+remapped", >>>>>> "snap_trimq": "[]", >>>>>> "epoch": 3114, >>>>>> "up": [ >>>>>> 4, >>>>>> 8 >>>>>> ], >>>>>> "acting": [ >>>>>> 4, >>>>>> 8, >>>>>> 1 >>>>>> ], >>>>>> >>>>>> <snip> >>>>>> >>>>>> # 8.1b >>>>>> <snip> >>>>>> >>>>>> "state": "active+remapped", >>>>>> "snap_trimq": "[]", >>>>>> "epoch": 3114, >>>>>> "up": [ >>>>>> 4, >>>>>> 7 >>>>>> ], >>>>>> "acting": [ >>>>>> 4, >>>>>> 7, >>>>>> 2 >>>>>> ], >>>>>> >>>>>> <snip> >>>>>> >>>>>> # 7.7a >>>>>> <snip> >>>>>> >>>>>> "state": "active+remapped", >>>>>> "snap_trimq": "[]", >>>>>> "epoch": 3114, >>>>>> "up": [ >>>>>> 7, >>>>>> 4 >>>>>> ], >>>>>> "acting": [ >>>>>> 7, >>>>>> 4, >>>>>> 2 >>>>>> ], >>>>>> >>>>>> <snip> >>>>>> >>>>>> >>>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com