Re: PGs stuck active+remapped and osds lose data?!

Marcus Müller <mueller.marcus@xxxxxxxxx> · Wed, 11 Jan 2017 14:50:05 +0100

Yes, but everything i want to know is, if my way to change the tunables is right or not?

> Am 11.01.2017 um 13:11 schrieb Shinobu Kinjo <skinjo@xxxxxxxxxx>:
> 
> Please refer to Jens's message.
> 
> Regards,
> 
>> On Wed, Jan 11, 2017 at 8:53 PM, Marcus Müller <mueller.marcus@xxxxxxxxx> wrote:
>> Ok, thank you. I thought I have to set ceph to a tunables profile. If I’m right, then I just have to export the current crush map, edit it and import it again, like:
>> 
>> ceph osd getcrushmap -o /tmp/crush
>> crushtool -i /tmp/crush --set-choose-total-tries 100 -o /tmp/crush.new
>> ceph osd setcrushmap -i /tmp/crush.new
>> 
>> Is this right or not?
>> 
>> I started this cluster with these 3 nodes and each 3 osds. They are vms. I knew that this cluster would expand very big, that’s the reason for my choice for ceph. Now I can’t add more HDDs to the vm hypervisor and I want to separate the nodes physically too. I bought a new node with these 4 drives and now another node with only 2 drives. As I hear now from several people this was not a good idea. For this reason, I bought now additional HDDs for the new node, so I have two with the same amount of HDDs and size. In the next 1-2 months I will get the third physical node and then everything should be fine. But at this time I have no other option.
>> 
>> May it help to solve this problem by adding the 2 new HDDs to the new ceph node?
>> 
>> 
>> 
>>> Am 11.01.2017 um 12:00 schrieb Brad Hubbard <bhubbard@xxxxxxxxxx>:
>>> 
>>> Your current problem has nothing to do with clients and neither does
>>> choose_total_tries.
>>> 
>>> Try setting just this value to 100 and see if your situation improves.
>>> 
>>> Ultimately you need to take a good look at your cluster configuration
>>> and how your crush map is configured to deal with that configuration
>>> but start with choose_total_tries as it has the highest probability of
>>> helping your situation. Your clients should not be affected.
>>> 
>>> Could you explain the reasoning behind having three hosts with one ods
>>> each, one host with two osds and one with four?
>>> 
>>> You likely need to tweak your crushmap to handle this configuration
>>> better or, preferably, move to a more uniform configuration.
>>> 
>>> 
>>>> On Wed, Jan 11, 2017 at 5:38 PM, Marcus Müller <mueller.marcus@xxxxxxxxx> wrote:
>>>> I have to thank you all. You give free support and this already helps me.
>>>> I’m not the one who knows ceph that good, but everyday it’s getting better
>>>> and better ;-)
>>>> 
>>>> According to the article Brad posted I have to change the ceph osd crush
>>>> tunables. But there are two questions left as I already wrote:
>>>> 
>>>> - According to
>>>> http://docs.ceph.com/docs/master/rados/operations/crush-map/#tunables there
>>>> are a few profiles. My needed profile would be BOBTAIL (CRUSH_TUNABLES2)
>>>> wich would set choose_total_tries to 50. For the beginning better than 19.
>>>> There I also see: "You can select a profile on a running cluster with the
>>>> command: ceph osd crush tunables {PROFILE}“. My question on this is: Even if
>>>> I run hammer, is it good and possible to set it to bobtail?
>>>> 
>>>> - We can also read:
>>>> WHICH CLIENT VERSIONS SUPPORT CRUSH_TUNABLES2
>>>> - v0.55 or later, including bobtail series (v0.56.x)
>>>> - Linux kernel version v3.9 or later (for the file system and RBD kernel
>>>> clients)
>>>> 
>>>> And here my question is: If my clients use librados (version hammer), do I
>>>> need to have this required kernel version on the clients or the ceph nodes?
>>>> 
>>>> I don’t want to have troubles at the end with my clients. Can someone answer
>>>> me this, before I change the settings?
>>>> 
>>>> 
>>>> Am 11.01.2017 um 06:47 schrieb Shinobu Kinjo <skinjo@xxxxxxxxxx>:
>>>> 
>>>> 
>>>> Yeah, Sam is correct. I've not looked at crushmap. But I should have
>>>> noticed what troublesome is with looking at `ceph osd tree`. That's my
>>>> bad, sorry for that.
>>>> 
>>>> Again please refer to:
>>>> 
>>>> http://www.anchor.com.au/blog/2013/02/pulling-apart-cephs-crush-algorithm/
>>>> 
>>>> Regards,
>>>> 
>>>> 
>>>> On Wed, Jan 11, 2017 at 1:50 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>>> 
>>>> Shinobu isn't correct, you have 9/9 osds up and running.  up does not
>>>> equal acting because crush is having trouble fulfilling the weights in
>>>> your crushmap and the acting set is being padded out with an extra osd
>>>> which happens to have the data to keep you up to the right number of
>>>> replicas.  Please refer back to Brad's post.
>>>> -Sam
>>>> 
>>>> On Mon, Jan 9, 2017 at 11:08 PM, Marcus Müller <mueller.marcus@xxxxxxxxx>
>>>> wrote:
>>>> 
>>>> Ok, i understand but how can I debug why they are not running as they
>>>> should? For me I thought everything is fine because ceph -s said they are up
>>>> and running.
>>>> 
>>>> I would think of a problem with the crush map.
>>>> 
>>>> Am 10.01.2017 um 08:06 schrieb Shinobu Kinjo <skinjo@xxxxxxxxxx>:
>>>> 
>>>> e.g.,
>>>> OSD7 / 3 / 0 are in the same acting set. They should be up, if they
>>>> are properly running.
>>>> 
>>>> # 9.7
>>>> <snip>
>>>> 
>>>> "up": [
>>>>    7,
>>>>    3
>>>> ],
>>>> "acting": [
>>>>    7,
>>>>    3,
>>>>    0
>>>> ],
>>>> 
>>>> <snip>
>>>> 
>>>> Here is an example:
>>>> 
>>>> "up": [
>>>> 1,
>>>> 0,
>>>> 2
>>>> ],
>>>> "acting": [
>>>> 1,
>>>> 0,
>>>> 2
>>>> ],
>>>> 
>>>> Regards,
>>>> 
>>>> 
>>>> On Tue, Jan 10, 2017 at 3:52 PM, Marcus Müller <mueller.marcus@xxxxxxxxx>
>>>> wrote:
>>>> 
>>>> 
>>>> That's not perfectly correct.
>>>> 
>>>> OSD.0/1/2 seem to be down.
>>>> 
>>>> 
>>>> 
>>>> Sorry but where do you see this? I think this indicates that they are up:
>>>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs?
>>>> 
>>>> 
>>>> Am 10.01.2017 um 07:50 schrieb Shinobu Kinjo <skinjo@xxxxxxxxxx>:
>>>> 
>>>> On Tue, Jan 10, 2017 at 3:44 PM, Marcus Müller <mueller.marcus@xxxxxxxxx>
>>>> wrote:
>>>> 
>>>> All osds are currently up:
>>>> 
>>>> health HEALTH_WARN
>>>>        4 pgs stuck unclean
>>>>        recovery 4482/58798254 objects degraded (0.008%)
>>>>        recovery 420522/58798254 objects misplaced (0.715%)
>>>>        noscrub,nodeep-scrub flag(s) set
>>>> monmap e9: 5 mons at
>>>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>>>        election epoch 478, quorum 0,1,2,3,4
>>>> ceph1,ceph2,ceph3,ceph4,ceph5
>>>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>>>        flags noscrub,nodeep-scrub
>>>>  pgmap v9981077: 320 pgs, 3 pools, 4837 GB data, 19140 kobjects
>>>>        15070 GB used, 40801 GB / 55872 GB avail
>>>>        4482/58798254 objects degraded (0.008%)
>>>>        420522/58798254 objects misplaced (0.715%)
>>>>             316 active+clean
>>>>               4 active+remapped
>>>> client io 56601 B/s rd, 45619 B/s wr, 0 op/s
>>>> 
>>>> This did not chance for two days or so.
>>>> 
>>>> 
>>>> By the way, my ceph osd df now looks like this:
>>>> 
>>>> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
>>>> 0 1.28899  1.00000  3724G  1699G  2024G 45.63 1.69
>>>> 1 1.57899  1.00000  3724G  1708G  2015G 45.87 1.70
>>>> 2 1.68900  1.00000  3724G  1695G  2028G 45.54 1.69
>>>> 3 6.78499  1.00000  7450G  1241G  6208G 16.67 0.62
>>>> 4 8.39999  1.00000  7450G  1228G  6221G 16.49 0.61
>>>> 5 9.51500  1.00000  7450G  1239G  6210G 16.64 0.62
>>>> 6 7.66499  1.00000  7450G  1265G  6184G 16.99 0.63
>>>> 7 9.75499  1.00000  7450G  2497G  4952G 33.52 1.24
>>>> 8 9.32999  1.00000  7450G  2495G  4954G 33.49 1.24
>>>>          TOTAL 55872G 15071G 40801G 26.97
>>>> MIN/MAX VAR: 0.61/1.70  STDDEV: 13.16
>>>> 
>>>> As you can see, now osd2 also went down to 45% Use and „lost“ data. But I
>>>> also think this is no problem and ceph just clears everything up after
>>>> backfilling.
>>>> 
>>>> 
>>>> Am 10.01.2017 um 07:29 schrieb Shinobu Kinjo <skinjo@xxxxxxxxxx>:
>>>> 
>>>> Looking at ``ceph -s`` you originally provided, all OSDs are up.
>>>> 
>>>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>>> 
>>>> 
>>>> But looking at ``pg query``, OSD.0 / 1 are not up. Are they something
>>>> 
>>>> 
>>>> That's not perfectly correct.
>>>> 
>>>> OSD.0/1/2 seem to be down.
>>>> 
>>>> like related to ?:
>>>> 
>>>> Ceph1, ceph2 and ceph3 are vms on one physical host
>>>> 
>>>> 
>>>> Are those OSDs running on vm instances?
>>>> 
>>>> # 9.7
>>>> <snip>
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>>  7,
>>>>  3
>>>> ],
>>>> "acting": [
>>>>  7,
>>>>  3,
>>>>  0
>>>> ],
>>>> 
>>>> <snip>
>>>> 
>>>> # 7.84
>>>> <snip>
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>>  4,
>>>>  8
>>>> ],
>>>> "acting": [
>>>>  4,
>>>>  8,
>>>>  1
>>>> ],
>>>> 
>>>> <snip>
>>>> 
>>>> # 8.1b
>>>> <snip>
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>>  4,
>>>>  7
>>>> ],
>>>> "acting": [
>>>>  4,
>>>>  7,
>>>>  2
>>>> ],
>>>> 
>>>> <snip>
>>>> 
>>>> # 7.7a
>>>> <snip>
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>>  7,
>>>>  4
>>>> ],
>>>> "acting": [
>>>>  7,
>>>>  4,
>>>>  2
>>>> ],
>>>> 
>>>> <snip>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Cheers,
>>> Brad
>> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com