Re: Beta testing crush optimization

Loic Dachary <loic@xxxxxxxxxxx> · Thu, 1 Jun 2017 09:17:04 +0200

On 06/01/2017 08:09 AM, han vincent wrote:
> Hi, Loic:
>   Thanks for you apply, And I still have some questions troubled me.
> 
>>>>Hi,
> 
>>>>I found the reason for the map problem, thanks a lot for reporting it. In a nutshell the "stable" tunable was implemented after hammer and your ceph report does not mention it at all. python-crush incorrectly assumes this means it should default to 1. It must default to 0 instead. When I do that manually, all mappings are correct. I'll fix this and publish a new version by tomorrow.
> If you fix the bug, please tell me, thanks.
> you said there was a "statble" tunable was implemented after hammer, Does it mean that the version of the "straw2" is unstable in hammer?

The tunable is meant to improve the straw2 placement. It does not indicate the straw2 code is unstable, it is a different concept.

> do you know which version the "statble" tunable was pulished? 

It was added Fri Nov 13 09:21:03 2015 -0500 by https://github.com/ceph/ceph/commit/fdb3f664448e80d984470f32f04e2e6f03ab52ec

It was released with Jewel http://docs.ceph.com/docs/master/release-notes/#v10.2.0-jewel
crush: add chooseleaf_stable tunable (pr#6572, Sangdi Xu, Sage Weil)

Cheers

>>>>Your cluster is not very unbalanced (less than 10% overfilled):
> 
>>>>(virtualenv) loic@fold:~/software/libcrush/python-crush$ crush analyze --crushmap /tmp/han-vincent-report.json --pool 49https://github.com/ceph/ceph/commit/fdb3f664448e80d984470f32f04e2e6f03ab52ec
>>>>         ~id~  ~weight~  ~PGs~  ~over/under filled %~ ~name~
>>>>node-6v    -4      1.08    427                   4.25
>>>>node-4     -2      1.08    416                   1.56
>>>>node-7v    -5      1.08    407                  -0.63
>>>>node-8v    -6      1.08    405                  -1.12
>>>>node-5v    -3      1.08    393                  -4.05
> 
>>>>Worst case scenario if a host fails:
> 
>>>>        ~over filled %~
>>>>~type~
>>>>device             7.81
>>>>host               4.10
>>>>root               0.00
>>>>(virtualenv) loic@fold:~/software/libcrush/python-crush$ crush analyze --type device --crushmap /tmp/han-vincent-report.json --pool 49
>>>>        ~id~  ~weight~  ~PGs~  ~over/under filled %~ ~name~
>>>>osd.5      5      0.54    221                   7.91
>>>>osd.0      0      0.54    211                   3.03
>>>>osd.7      7      0.54    210                   2.54
>>>>osd.4      4      0.54    206                   0.59
>>>>osd.8      8      0.54    206                   0.59
>>>>osd.1      1      0.54    205                   0.10
>>>>osd.3      3      0.54    200                  -2.34
>>>>osd.9      9      0.54    199                  -2.83
>>>>osd.6      6      0.54    197                  -3.81
>>>>osd.2      2      0.54    193                  -5.76
> 
>>>>Worst case scenario if a host fails:
> 
>>>>        ~over filled %~
>>>>~type~
>>>>device             7.81
>>>>host               4.10
>>>>root               0.00
> 
>>>>With optimization things will improve:
> 
>>>>$ crush optimize --crushmap /tmp/han-vincent-report.json --out-path /tmp/han-vincent-report-optimized.txt --out-format txt --pool 49
>>>>2017-05-31 15:17:59,917 argv = optimize --crushmap /tmp/han-vincent-report.json --out-path /tmp/han-vincent-report-optimized.txt --out-format txt --pool 49 --replication-count=2 --pg-num=1024 --pgp-num=1024 --rule=replicated_ruleset --out-version=h --no-positions --choose-args=49
>>>>2017-05-31 15:17:59,940 default optimizing
>>>>2017-05-31 15:18:05,007 default wants to swap 43 PGs
>>>>2017-05-31 15:18:05,013 node-6v optimizing
>>>>2017-05-31 15:18:05,013 node-4 optimizing
>>>>2017-05-31 15:18:05,016 node-8v optimizing
>>>>2017-05-31 15:18:05,016 node-7v optimizing
>>>>2017-05-31 15:18:05,018 node-5v optimizing
>>>>2017-05-31 15:18:05,369 node-4 wants to swap 8 PGs
>>>>2017-05-31 15:18:05,742 node-6v wants to swap 10 PGs
>>>>2017-05-31 15:18:06,382 node-5v wants to swap 7 PGs
>>>>2017-05-31 15:18:06,602 node-7v wants to swap 7 PGs
>>>>2017-05-31 15:18:07,346 node-8v already optimized
> Is "--pool" options must specified in this command? if not, will it optimize all the pools without "--pool" option?
> if there are several pools in my cluster and each pool has a lot of pgs. If I optimize one of the, will it affect the other pools?
> how to use it to optimize multiple pools in a cluster of hammer?
> 
>>>>(virtualenv) loic@fold:~/software/libcrush/python-crush$ crush analyze --crushmap /tmp/han-vincent-report-optimized.txt --pool 49 --replication-count=2 --pg-num=1024 --pgp-num=1024 --rule=replicated_ruleset --choose-args=49
>>>>         ~id~  ~weight~  ~PGs~  ~over/under filled %~ ~name~
>>>>node-4     -2      1.08    410                   0.10
>>>>node-5v    -3      1.08    410                   0.10
>>>>node-6v    -4      1.08    410                   0.10
>>>>node-7v    -5      1.08    409                  -0.15
>>>>node-8v    -6      1.08    409                  -0.15
> In this command, the value of the "--choose-args" option is 49, it is same as the pool id, what is the mean of "--choose-args" option? 
> 
>>>>Worst case scenario if a host fails:
> 
>>>>        ~over filled %~
>>>>~type~
>>>>device             5.47
>>>>host               3.71
>>>>root               0.00
>>>>(virtualenv) loic@fold:~/software/libcrush/python-crush$ crush analyze --type device --crushmap /tmp/han-vincent-report-optimized.txt --pool 49 --replication-count=2 --pg-num=1024 --pgp-num=1024 --rule=replicated_ruleset --choose-args=49
>>>>        ~id~  ~weight~  ~PGs~  ~over/under filled %~ ~name~
>>>>osd.2      2      0.54    206                   0.59
>>>>osd.8      8      0.54    206                   0.59
>>>>osd.0      0      0.54    205                   0.10
>>>>osd.1      1      0.54    205                   0.10
>>>>osd.4      4      0.54    205                   0.10
>>>>osd.5      5      0.54    205                   0.10
>>>>osd.7      7      0.54    205                   0.10
>>>>osd.3      3      0.54    204                  -0.39
>>>>osd.6      6      0.54    204                  -0.39
>>>>osd.9      9      0.54    203                  -0.88
> 
>>>>Worst case scenario if a host fails:
> 
>>>>        ~over filled %~
>>>>~type~
>>>>device             5.47
> 
>>>>host               3.71
>>>>root               0.00
> 
>>>>Note that the other pools won't be optimized and their PGs will be moved around for no good reason. However, since they contain very few PGs each (8 for most of them, 32 for one of them) and very little data (less than 1MB total), it won't matter much.
> 
>>>>Cheers
> 
> 
> 
>>On 05/31/2017 02:34 PM, Loic Dachary wrote:
>> Hi,
>>
>> On 05/31/2017 12:32 PM, han vincent wrote:
>>> hello, loic:
>>>
>>> I had a cluster build with hammer 0.94.10, then I used the following commands to change the algorithm from "straw" to "straw2".
>>> 1. ceph osd crush tunables hammer
>>> 2. ceph osd getcrushmap -o /tmp/cmap
>>> 3. crushtool -d /tmp/cmap -o /tmp/cmap.txt 4. vim /tmp/cmap.txt and 
>>> change the algorithm of each bucket from "straw" to "straw2"
>>> 5. crushtool -c /tmp/cmap.txt -o /tmp/cmap 6. ceph osd setcrushmap -i 
>>> /tmp/cmap 7. ceph osd crush reweight-all after that, I used "python 
>>> crush" to optimize the cluster, the version of "python crush" is 
>>> 1.0.32
>>>
>>> 1. ceph report > report.json
>>> 2. crush optimize --crushmap report.json --out-path optimized.crush 
>>> Unfortunately, there was an error in the output:
>>>
>>> 2017-05-30 18:48:01,803 42.1 map to [4, 9] instead of [4, 8]
>>> 2017-05-30 18:48:01,838 49.3af map to [9, 2] instead of [9, 3]
>>> 2017-05-30 18:48:01,838 49.e3 map to [6, 4] instead of [6, 5]
>>> 2017-05-30 18:48:01,838 49.e1 map to [7, 2] instead of [7, 3]
>>> 2017-05-30 18:48:01,838 49.e0 map to [5, 1] instead of [5, 0]
>>> 2017-05-30 18:48:01,838 49.20d map to [3, 1] instead of [3, 0]
>>> 2017-05-30 18:48:01,838 49.20c map to [2, 9] instead of [2, 8]
>>> 2017-05-30 18:48:01,838 49.36e map to [6, 1] instead of [6, 0] ......
>>>
>>> Traceback (most recent call last):
>>>  File "/usr/bin/crush", line 25, in <module>
>>> sys.exit(Ceph().main(sys.argv[1:]))
>>>  File "/usr/lib64/python2.7/site-packages/crush/main.py", line 136, 
>>> in main return self.constructor(argv).run()  File 
>>> "/usr/lib64/python2.7/site-packages/crush/optimize.py", line 373, in 
>>> run crushmap = self.main.convert_to_crushmap(self.args.crushmap)
>>>  File "/usr/lib64/python2.7/site-packages/crush/ceph/__init__.py", 
>>> line 690, in convert_to_crushmap
>>> c.parse(crushmap)
>>>  File "/usr/lib64/python2.7/site-packages/crush/__init__.py", line 
>>> 138, in parse return 
>>> self.parse_crushmap(self._convert_to_crushmap(something))
>>>  File "/usr/lib64/python2.7/site-packages/crush/ceph/__init__.py", 
>>> line 416, in _convert_to_crushmap crushmap = 
>>> CephReport().parse_report(something)
>>>  File "/usr/lib64/python2.7/site-packages/crush/ceph/__init__.py", 
>>> line 137, in parse_report raise MappingError("some mapping failed, please file a bug at "
>>> crush.ceph.MappingError: some mapping failed, please file a bug at 
>>> http://libcrush.org/main/python-crush/issues/new
>>> Do you know what the problem is? can you help me? I would be very grateful to you.
>>
>> This is a safeguard to make sure python-crush maps exactly as expected. I'm not sure yet why there is a difference but I'll work on that, using the crush implementation found in hammer 0.94.10. For your information, the full output of:
>>
>> $ crush analyze --crushmap /tmp/han-vincent-report.json
>>
>> is at https://paste2.org/PyeHe2dC What I find strange is that your output regarding pool 42 is different than mine. You have:
>>
>>
>> 2017-05-30 18:48:01,803 42.1 map to [4, 9] instead of [4, 8]
>>
>> and I have
>>
>> 2017-05-31 12:55:04,207 42.3 map to [4, 3] instead of [4, 2]
>> 2017-05-31 12:55:04,207 42.7 map to [8, 0] instead of [8, 1]
>> 2017-05-31 12:55:04,207 42.1 map to [4, 9] instead of [4, 8]
>>
>> I wonder if that's a sign that the changes to the crushmap following your change to straw2 are still going on. Would you mind sending me the output of ceph report (please run it again after receiving this mail) ?
>>
>> Cheers
>>
> 
>>--
>>Loïc Dachary, Artisan Logiciel Libre
> 
> 
> 2017-05-31 20:40 GMT+08:00 Loic Dachary <loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx>>:
> 
>     Hi,
> 
>     I found the reason for the map problem, thanks a lot for reporting it. In a nutshell the "stable" tunable was implemented after hammer and your ceph report does not mention it at all. python-crush incorrectly assumes this means it should default to 1. It must default to 0 instead. When I do that manually, all mappings are correct. I'll fix this and publish a new version by tomorrow.
> 
>     Your cluster is not very unbalanced (less than 10% overfilled):
> 
>     (virtualenv) loic@fold:~/software/libcrush/python-crush$ crush analyze --crushmap /tmp/han-vincent-report.json --pool 49
>              ~id~  ~weight~  ~PGs~  ~over/under filled %~
>     ~name~
>     node-6v    -4      1.08    427                   4.25
>     node-4     -2      1.08    416                   1.56
>     node-7v    -5      1.08    407                  -0.63
>     node-8v    -6      1.08    405                  -1.12
>     node-5v    -3      1.08    393                  -4.05
> 
>     Worst case scenario if a host fails:
> 
>             ~over filled %~
>     ~type~
>     device             7.81
>     host               4.10
>     root               0.00
>     (virtualenv) loic@fold:~/software/libcrush/python-crush$ crush analyze --type device --crushmap /tmp/han-vincent-report.json --pool 49
>             ~id~  ~weight~  ~PGs~  ~over/under filled %~
>     ~name~
>     osd.5      5      0.54    221                   7.91
>     osd.0      0      0.54    211                   3.03
>     osd.7      7      0.54    210                   2.54
>     osd.4      4      0.54    206                   0.59
>     osd.8      8      0.54    206                   0.59
>     osd.1      1      0.54    205                   0.10
>     osd.3      3      0.54    200                  -2.34
>     osd.9      9      0.54    199                  -2.83
>     osd.6      6      0.54    197                  -3.81
>     osd.2      2      0.54    193                  -5.76
> 
>     Worst case scenario if a host fails:
> 
>             ~over filled %~
>     ~type~
>     device             7.81
>     host               4.10
>     root               0.00
> 
>     With optimization things will improve:
> 
>     $ crush optimize --crushmap /tmp/han-vincent-report.json --out-path /tmp/han-vincent-report-optimized.txt --out-format txt --pool 49
>     2017-05-31 15:17:59,917 argv = optimize --crushmap /tmp/han-vincent-report.json --out-path /tmp/han-vincent-report-optimized.txt --out-format txt --pool 49 --replication-count=2 --pg-num=1024 --pgp-num=1024 --rule=replicated_ruleset --out-version=h --no-positions --choose-args=49
>     2017-05-31 15:17:59,940 default optimizing
>     2017-05-31 15:18:05,007 default wants to swap 43 PGs
>     2017-05-31 15:18:05,013 node-6v optimizing
>     2017-05-31 15:18:05,013 node-4 optimizing
>     2017-05-31 15:18:05,016 node-8v optimizing
>     2017-05-31 15:18:05,016 node-7v optimizing
>     2017-05-31 15:18:05,018 node-5v optimizing
>     2017-05-31 15:18:05,369 node-4 wants to swap 8 PGs
>     2017-05-31 15:18:05,742 node-6v wants to swap 10 PGs
>     2017-05-31 15:18:06,382 node-5v wants to swap 7 PGs
>     2017-05-31 15:18:06,602 node-7v wants to swap 7 PGs
>     2017-05-31 15:18:07,346 node-8v already optimized
> 
>     (virtualenv) loic@fold:~/software/libcrush/python-crush$ crush analyze --crushmap /tmp/han-vincent-report-optimized.txt --pool 49 --replication-count=2 --pg-num=1024 --pgp-num=1024 --rule=replicated_ruleset --choose-args=49
>              ~id~  ~weight~  ~PGs~  ~over/under filled %~
>     ~name~
>     node-4     -2      1.08    410                   0.10
>     node-5v    -3      1.08    410                   0.10
>     node-6v    -4      1.08    410                   0.10
>     node-7v    -5      1.08    409                  -0.15
>     node-8v    -6      1.08    409                  -0.15
> 
>     Worst case scenario if a host fails:
> 
>             ~over filled %~
>     ~type~
>     device             5.47
>     host               3.71
>     root               0.00
>     (virtualenv) loic@fold:~/software/libcrush/python-crush$ crush analyze --type device --crushmap /tmp/han-vincent-report-optimized.txt --pool 49 --replication-count=2 --pg-num=1024 --pgp-num=1024 --rule=replicated_ruleset --choose-args=49
>             ~id~  ~weight~  ~PGs~  ~over/under filled %~
>     ~name~
>     osd.2      2      0.54    206                   0.59
>     osd.8      8      0.54    206                   0.59
>     osd.0      0      0.54    205                   0.10
>     osd.1      1      0.54    205                   0.10
>     osd.4      4      0.54    205                   0.10
>     osd.5      5      0.54    205                   0.10
>     osd.7      7      0.54    205                   0.10
>     osd.3      3      0.54    204                  -0.39
>     osd.6      6      0.54    204                  -0.39
>     osd.9      9      0.54    203                  -0.88
> 
>     Worst case scenario if a host fails:
> 
>             ~over filled %~
>     ~type~
>     device             5.47
> 
>     host               3.71
>     root               0.00
> 
>     Note that the other pools won't be optimized and their PGs will be moved around for no good reason. However, since they contain very few PGs each (8 for most of them, 32 for one of them) and very little data (less than 1MB total), it won't matter much.
> 
>     Cheers
> 
>     On 05/31/2017 02:34 PM, Loic Dachary wrote:
>     > Hi,
>     >
>     > On 05/31/2017 12:32 PM, han vincent wrote:
>     >> hello, loic:
>     >>
>     >> I had a cluster build with hammer 0.94.10, then I used the following commands to change the algorithm from "straw" to "straw2".
>     >> 1. ceph osd crush tunables hammer
>     >> 2. ceph osd getcrushmap -o /tmp/cmap
>     >> 3. crushtool -d /tmp/cmap -o /tmp/cmap.txt
>     >> 4. vim /tmp/cmap.txt and change the algorithm of each bucket from "straw" to "straw2"
>     >> 5. crushtool -c /tmp/cmap.txt -o /tmp/cmap
>     >> 6. ceph osd setcrushmap -i /tmp/cmap
>     >> 7. ceph osd crush reweight-all
>     >> after that, I used "python crush" to optimize the cluster, the version of "python crush" is 1.0.32
>     >>
>     >> 1. ceph report > report.json
>     >> 2. crush optimize --crushmap report.json --out-path optimized.crush
>     >> Unfortunately, there was an error in the output:
>     >>
>     >> 2017-05-30 18:48:01,803 42.1 map to [4, 9] instead of [4, 8]
>     >> 2017-05-30 18:48:01,838 49.3af map to [9, 2] instead of [9, 3]
>     >> 2017-05-30 18:48:01,838 49.e3 map to [6, 4] instead of [6, 5]
>     >> 2017-05-30 18:48:01,838 49.e1 map to [7, 2] instead of [7, 3]
>     >> 2017-05-30 18:48:01,838 49.e0 map to [5, 1] instead of [5, 0]
>     >> 2017-05-30 18:48:01,838 49.20d map to [3, 1] instead of [3, 0]
>     >> 2017-05-30 18:48:01,838 49.20c map to [2, 9] instead of [2, 8]
>     >> 2017-05-30 18:48:01,838 49.36e map to [6, 1] instead of [6, 0]
>     >> ......
>     >>
>     >> Traceback (most recent call last):
>     >>  File "/usr/bin/crush", line 25, in <module>
>     >> sys.exit(Ceph().main(sys.argv[1:]))
>     >>  File "/usr/lib64/python2.7/site-packages/crush/main.py", line 136, in main
>     >> return self.constructor(argv).run()
>     >>  File "/usr/lib64/python2.7/site-packages/crush/optimize.py", line 373, in run
>     >> crushmap = self.main.convert_to_crushmap(self.args.crushmap)
>     >>  File "/usr/lib64/python2.7/site-packages/crush/ceph/__init__.py", line 690, in convert_to_crushmap
>     >> c.parse(crushmap)
>     >>  File "/usr/lib64/python2.7/site-packages/crush/__init__.py", line 138, in parse
>     >> return self.parse_crushmap(self._convert_to_crushmap(something))
>     >>  File "/usr/lib64/python2.7/site-packages/crush/ceph/__init__.py", line 416, in _convert_to_crushmap
>     >> crushmap = CephReport().parse_report(something)
>     >>  File "/usr/lib64/python2.7/site-packages/crush/ceph/__init__.py", line 137, in parse_report
>     >> raise MappingError("some mapping failed, please file a bug at "
>     >> crush.ceph.MappingError: some mapping failed, please file a bug at http://libcrush.org/main/python-crush/issues/new <http://libcrush.org/main/python-crush/issues/new>
>     >> Do you know what the problem is? can you help me? I would be very grateful to you.
>     >
>     > This is a safeguard to make sure python-crush maps exactly as expected. I'm not sure yet why there is a difference but I'll work on that, using the crush implementation found in hammer 0.94.10. For your information, the full output of:
>     >
>     > $ crush analyze --crushmap /tmp/han-vincent-report.json
>     >
>     > is at https://paste2.org/PyeHe2dC What I find strange is that your output regarding pool 42 is different than mine. You have:
>     >
>     >
>     > 2017-05-30 18:48:01,803 42.1 map to [4, 9] instead of [4, 8]
>     >
>     > and I have
>     >
>     > 2017-05-31 12:55:04,207 42.3 map to [4, 3] instead of [4, 2]
>     > 2017-05-31 12:55:04,207 42.7 map to [8, 0] instead of [8, 1]
>     > 2017-05-31 12:55:04,207 42.1 map to [4, 9] instead of [4, 8]
>     >
>     > I wonder if that's a sign that the changes to the crushmap following your change to straw2 are still going on. Would you mind sending me the output of ceph report (please run it again after receiving this mail) ?
>     >
>     > Cheers
>     >
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html