Re: A problem when restarting OSD

David Moreau Simard <dmsimard@xxxxxxxx> · Fri, 22 Aug 2014 02:33:34 +0000

I¹m glad you mention this because I¹ve also been running into the same
issue and this took me a while to figure out too.

Is this new behaviour ? I don¹t remember running into this before...

Sage does mention multiple trees but I¹ve had this happen with a single
root.
It is definitely not my expectation that restarting an OSD would move
things around in the crush map.

I¹m in the process of developing a crush map, looks like this (note:
unfinished and does not make much sense as is):
http://pastebin.com/6vBUQTCk
This results in this tree:
# id	weight	type name	up/down	reweight
-1	18	root default
-2	9		host osd02
-4	2			disktype osd02_ssd
3	1				osd.3	up	1
9	1				osd.9	up	1
-5	7			disktype osd02_spinning
8	1				osd.8	up	1
17	1				osd.17	up	1
5	1				osd.5	up	1
11	1				osd.11	up	1
1	1				osd.1	up	1
13	1				osd.13	up	1
15	1				osd.15	up	1
-3	9		host osd01
-6	2			disktype osd01_ssd
2	1				osd.2	up	1
7	1				osd.7	up	1
-7	7			disktype osd01_spinning
0	1				osd.0	up	1
4	1				osd.4	up	1
12	1				osd.12	up	1
6	1				osd.6	up	1
14	1				osd.14	up	1
10	1				osd.10	up	1
16	1				osd.16	up	1

Only restarting the OSDs on both hosts modifies the crush map:
http://pastebin.com/rP8Y8qcH
With the resulting tree:
# id	weight	type name	up/down	reweight
-1	18	root default
-2	9		host osd02
-4	0			disktype osd02_ssd
-5	0			disktype osd02_spinning
13	1			osd.13	up	1
3	1			osd.3	up	1
5	1			osd.5	up	1
1	1			osd.1	up	1
11	1			osd.11	up	1
15	1			osd.15	up	1
17	1			osd.17	up	1
8	1			osd.8	up	1
9	1			osd.9	up	1
-3	9		host osd01
-6	0			disktype osd01_ssd
-7	0			disktype osd01_spinning
0	1			osd.0	up	1
10	1			osd.10	up	1
12	1			osd.12	up	1
14	1			osd.14	up	1
16	1			osd.16	up	1
2	1			osd.2	up	1
4	1			osd.4	up	1
7	1			osd.7	up	1
6	1			osd.6	up	1

Would a hook really be the solution I need ?
-- 
David Moreau Simard

Le 2014-08-21, 9:36 PM, « Wang, Zhiqiang » <zhiqiang.wang@xxxxxxxxx> a
écrit :

>Hi Sage,
>
>Yes, I understand that we can customize the crush location hook to let
>the OSD go to the right location. But does the ceph user have the idea of
>this if he/she has more than 1 root in the crush map? At least I don't
>know this at the beginning. We need to either emphasize this or do it in
>some ways for the user.
>
>One question for the hot-swapping support of moving an OSD to another
>host. What if the journal is not located at the same disk of the OSD? Is
>the OSD still able to be available in the cluster?
>
>-----Original Message-----
>From: Sage Weil [mailto:sweil@xxxxxxxxxx]
>Sent: Thursday, August 21, 2014 11:28 PM
>To: Wang, Zhiqiang
>Cc: 'ceph-devel@xxxxxxxxxxxxxxx'
>Subject: Re: A problem when restarting OSD
>
>On Thu, 21 Aug 2014, Wang, Zhiqiang wrote:
>> Hi all,
>> 
>> I ran into a problem when restarting an OSD.
>> 
>> Here is my OSD tree before restarting the OSD:
>> 
>> # id    weight  type name       up/down reweight
>> -6      8       root ssd
>> -4      4               host zqw-s1-ssd
>> 16      1                       osd.16  up      1
>> 17      1                       osd.17  up      1
>> 18      1                       osd.18  up      1
>> 19      1                       osd.19  up      1
>> -5      4               host zqw-s2-ssd
>> 20      1                       osd.20  up      1
>> 21      1                       osd.21  up      1
>> 22      1                       osd.22  up      1
>> 23      1                       osd.23  up      1
>> -1      14.56   root default
>> -2      7.28            host zqw-s1
>> 0       0.91                    osd.0   up      1
>> 1       0.91                    osd.1   up      1
>> 2       0.91                    osd.2   up      1
>> 3       0.91                    osd.3   up      1
>> 4       0.91                    osd.4   up      1
>> 5       0.91                    osd.5   up      1
>> 6       0.91                    osd.6   up      1
>> 7       0.91                    osd.7   up      1
>> -3      7.28            host zqw-s2
>> 8       0.91                    osd.8   up      1
>> 9       0.91                    osd.9   up      1
>> 10      0.91                    osd.10  up      1
>> 11      0.91                    osd.11  up      1
>> 12      0.91                    osd.12  up      1
>> 13      0.91                    osd.13  up      1
>> 14      0.91                    osd.14  up      1
>> 15      0.91                    osd.15  up      1
>> 
>> After I restart one of the OSD with id from 16 to 23, say restarting
>>osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph
>>cluster begins to do rebalance. This surely is not what I want.
>> 
>> # id    weight  type name       up/down reweight
>> -6      7       root ssd
>> -4      3               host zqw-s1-ssd
>> 17      1                       osd.17  up      1
>> 18      1                       osd.18  up      1
>> 19      1                       osd.19  up      1
>> -5      4               host zqw-s2-ssd
>> 20      1                       osd.20  up      1
>> 21      1                       osd.21  up      1
>> 22      1                       osd.22  up      1
>> 23      1                       osd.23  up      1
>> -1      15.56   root default
>> -2      8.28            host zqw-s1
>> 0       0.91                    osd.0   up      1
>> 1       0.91                    osd.1   up      1
>> 2       0.91                    osd.2   up      1
>> 3       0.91                    osd.3   up      1
>> 4       0.91                    osd.4   up      1
>> 5       0.91                    osd.5   up      1
>> 6       0.91                    osd.6   up      1
>> 7       0.91                    osd.7   up      1
>> 16      1                       osd.16  up      1
>> -3      7.28            host zqw-s2
>> 8       0.91                    osd.8   up      1
>> 9       0.91                    osd.9   up      1
>> 10      0.91                    osd.10  up      1
>> 11      0.91                    osd.11  up      1
>> 12      0.91                    osd.12  up      1
>> 13      0.91                    osd.13  up      1
>> 14      0.91                    osd.14  up      1
>> 15      0.91                    osd.15  up      1
>> 
>> After digging into the problem, I find it's because in the ceph init
>>script, we change the OSD's crush location in some way. It uses the
>>script 'ceph-crush-location' to get the crush location from the
>>ceph.conf file for the restarting OSD. If there isn't such an entry in
>>ceph.conf, it uses the default one 'host=$(hostname -s) root=default'.
>>Since I don't have the crush location configuration in my ceph.conf (I
>>guess most of people don't have this in their ceph.conf), when I
>>restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
>> 
>> Here is a fix for this:
>> When the ceph init script uses 'ceph osd crush create-or-move' to
>> change the OSD's crush location, do a check first, if this OSD is
>> already existing in the crush map, return without making the location
>> change. This change is at:
>> https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd1539878
>> 761412fe
>> 
>> What do you think?
>
>The goal of this behavior is to allow hot-swapping of devices.  You can
>pull disks out of one host and put them in another and the udev machinery
>will start up the daemon, update the crush location, and the disk and
>data will become available.  It's not 'ideal' in the sense that there
>will be rebalancing, but it does make the data available to the cluster
>to preserve data safety.
>
>We haven't come up with a great scheme yet to managing multiple trees
>yet.  
>The idea is that the ceph-crush-location hook can be customized to do
>whatever is necessary, for example by putting root=ssd if the device type
>appears to be an ssd (maybe look at the sysfs metadata, or put a marker
>file in the osd data directory?).  You can point to your own hook for
>your environment with
>
>  osd crush location hook = /path/to/my/script
>
>sage
>
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@xxxxxxxxxxxxxxx
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html