Hi Wang, Thanks, I’ll try that for the time being. This still raises a few questions I’d like to discuss. I’m convinced we can agree that the CRUSH map is ultimately the authority as far as the location of the devices currently are. My understanding is that we are relying on another source for device location when (in this case) restarting OSDs: the ceph.conf file. 1) Does this imply that we probably shouldn’t specify device locations directly in the crush map but in our ceph.conf file instead ? 2) If what is in the crush map is different than what is configured in ceph.conf, how does Ceph decide which is the authority ? Shouldn’t it be the crush map ? In this case, it appears to be the ceph.conf file. Just trying to wrap my head around the vision of how things should be managed. -- David Moreau Simard Le 2014-08-21, 10:57 PM, « Wang, Zhiqiang » <zhiqiang.wang@xxxxxxxxx> a écrit : >Hi David, > >Yes, I think adding a hook in your ceph.conf can solve your problem. At >least this is what I did, and it solves the problem. > >For example: > >[osd.3] >osd crush location = "host=osd02 root=default disktype=osd02_ssd" > >You need to add this for every osd. > >-----Original Message----- >From: David Moreau Simard [mailto:dmsimard@xxxxxxxx] >Sent: Friday, August 22, 2014 10:34 AM >To: Wang, Zhiqiang; Sage Weil >Cc: 'ceph-devel@xxxxxxxxxxxxxxx' >Subject: Re: A problem when restarting OSD > >I¹m glad you mention this because I¹ve also been running into the same >issue and this took me a while to figure out too. > >Is this new behaviour ? I don¹t remember running into this before... > >Sage does mention multiple trees but I¹ve had this happen with a single >root. >It is definitely not my expectation that restarting an OSD would move >things around in the crush map. > >I¹m in the process of developing a crush map, looks like this (note: >unfinished and does not make much sense as is): >http://pastebin.com/6vBUQTCk >This results in this tree: ># id weight type name up/down reweight >-1 18 root default >-2 9 host osd02 >-4 2 disktype osd02_ssd >3 1 osd.3 up 1 >9 1 osd.9 up 1 >-5 7 disktype osd02_spinning >8 1 osd.8 up 1 >17 1 osd.17 up 1 >5 1 osd.5 up 1 >11 1 osd.11 up 1 >1 1 osd.1 up 1 >13 1 osd.13 up 1 >15 1 osd.15 up 1 >-3 9 host osd01 >-6 2 disktype osd01_ssd >2 1 osd.2 up 1 >7 1 osd.7 up 1 >-7 7 disktype osd01_spinning >0 1 osd.0 up 1 >4 1 osd.4 up 1 >12 1 osd.12 up 1 >6 1 osd.6 up 1 >14 1 osd.14 up 1 >10 1 osd.10 up 1 >16 1 osd.16 up 1 > >Only restarting the OSDs on both hosts modifies the crush map: >http://pastebin.com/rP8Y8qcH >With the resulting tree: ># id weight type name up/down reweight >-1 18 root default >-2 9 host osd02 >-4 0 disktype osd02_ssd >-5 0 disktype osd02_spinning >13 1 osd.13 up 1 >3 1 osd.3 up 1 >5 1 osd.5 up 1 >1 1 osd.1 up 1 >11 1 osd.11 up 1 >15 1 osd.15 up 1 >17 1 osd.17 up 1 >8 1 osd.8 up 1 >9 1 osd.9 up 1 >-3 9 host osd01 >-6 0 disktype osd01_ssd >-7 0 disktype osd01_spinning >0 1 osd.0 up 1 >10 1 osd.10 up 1 >12 1 osd.12 up 1 >14 1 osd.14 up 1 >16 1 osd.16 up 1 >2 1 osd.2 up 1 >4 1 osd.4 up 1 >7 1 osd.7 up 1 >6 1 osd.6 up 1 > >Would a hook really be the solution I need ? >-- >David Moreau Simard > >Le 2014-08-21, 9:36 PM, « Wang, Zhiqiang » <zhiqiang.wang@xxxxxxxxx> a >écrit : > >>Hi Sage, >> >>Yes, I understand that we can customize the crush location hook to let >>the OSD go to the right location. But does the ceph user have the idea >>of this if he/she has more than 1 root in the crush map? At least I >>don't know this at the beginning. We need to either emphasize this or >>do it in some ways for the user. >> >>One question for the hot-swapping support of moving an OSD to another >>host. What if the journal is not located at the same disk of the OSD? >>Is the OSD still able to be available in the cluster? >> >>-----Original Message----- >>From: Sage Weil [mailto:sweil@xxxxxxxxxx] >>Sent: Thursday, August 21, 2014 11:28 PM >>To: Wang, Zhiqiang >>Cc: 'ceph-devel@xxxxxxxxxxxxxxx' >>Subject: Re: A problem when restarting OSD >> >>On Thu, 21 Aug 2014, Wang, Zhiqiang wrote: >>> Hi all, >>> >>> I ran into a problem when restarting an OSD. >>> >>> Here is my OSD tree before restarting the OSD: >>> >>> # id weight type name up/down reweight >>> -6 8 root ssd >>> -4 4 host zqw-s1-ssd >>> 16 1 osd.16 up 1 >>> 17 1 osd.17 up 1 >>> 18 1 osd.18 up 1 >>> 19 1 osd.19 up 1 >>> -5 4 host zqw-s2-ssd >>> 20 1 osd.20 up 1 >>> 21 1 osd.21 up 1 >>> 22 1 osd.22 up 1 >>> 23 1 osd.23 up 1 >>> -1 14.56 root default >>> -2 7.28 host zqw-s1 >>> 0 0.91 osd.0 up 1 >>> 1 0.91 osd.1 up 1 >>> 2 0.91 osd.2 up 1 >>> 3 0.91 osd.3 up 1 >>> 4 0.91 osd.4 up 1 >>> 5 0.91 osd.5 up 1 >>> 6 0.91 osd.6 up 1 >>> 7 0.91 osd.7 up 1 >>> -3 7.28 host zqw-s2 >>> 8 0.91 osd.8 up 1 >>> 9 0.91 osd.9 up 1 >>> 10 0.91 osd.10 up 1 >>> 11 0.91 osd.11 up 1 >>> 12 0.91 osd.12 up 1 >>> 13 0.91 osd.13 up 1 >>> 14 0.91 osd.14 up 1 >>> 15 0.91 osd.15 up 1 >>> >>> After I restart one of the OSD with id from 16 to 23, say restarting >>>osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph >>>cluster begins to do rebalance. This surely is not what I want. >>> >>> # id weight type name up/down reweight >>> -6 7 root ssd >>> -4 3 host zqw-s1-ssd >>> 17 1 osd.17 up 1 >>> 18 1 osd.18 up 1 >>> 19 1 osd.19 up 1 >>> -5 4 host zqw-s2-ssd >>> 20 1 osd.20 up 1 >>> 21 1 osd.21 up 1 >>> 22 1 osd.22 up 1 >>> 23 1 osd.23 up 1 >>> -1 15.56 root default >>> -2 8.28 host zqw-s1 >>> 0 0.91 osd.0 up 1 >>> 1 0.91 osd.1 up 1 >>> 2 0.91 osd.2 up 1 >>> 3 0.91 osd.3 up 1 >>> 4 0.91 osd.4 up 1 >>> 5 0.91 osd.5 up 1 >>> 6 0.91 osd.6 up 1 >>> 7 0.91 osd.7 up 1 >>> 16 1 osd.16 up 1 >>> -3 7.28 host zqw-s2 >>> 8 0.91 osd.8 up 1 >>> 9 0.91 osd.9 up 1 >>> 10 0.91 osd.10 up 1 >>> 11 0.91 osd.11 up 1 >>> 12 0.91 osd.12 up 1 >>> 13 0.91 osd.13 up 1 >>> 14 0.91 osd.14 up 1 >>> 15 0.91 osd.15 up 1 >>> >>> After digging into the problem, I find it's because in the ceph init >>>script, we change the OSD's crush location in some way. It uses the >>>script 'ceph-crush-location' to get the crush location from the >>>ceph.conf file for the restarting OSD. If there isn't such an entry in >>>ceph.conf, it uses the default one 'host=$(hostname -s) root=default'. >>>Since I don't have the crush location configuration in my ceph.conf (I >>>guess most of people don't have this in their ceph.conf), when I >>>restarting osd.16, it goes to 'root default' and 'host zqw-s1'. >>> >>> Here is a fix for this: >>> When the ceph init script uses 'ceph osd crush create-or-move' to >>> change the OSD's crush location, do a check first, if this OSD is >>> already existing in the crush map, return without making the location >>> change. This change is at: >>> https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd153987 >>> 8 >>> 761412fe >>> >>> What do you think? >> >>The goal of this behavior is to allow hot-swapping of devices. You can >>pull disks out of one host and put them in another and the udev >>machinery will start up the daemon, update the crush location, and the >>disk and data will become available. It's not 'ideal' in the sense >>that there will be rebalancing, but it does make the data available to >>the cluster to preserve data safety. >> >>We haven't come up with a great scheme yet to managing multiple trees >>yet. >>The idea is that the ceph-crush-location hook can be customized to do >>whatever is necessary, for example by putting root=ssd if the device >>type appears to be an ssd (maybe look at the sysfs metadata, or put a >>marker file in the osd data directory?). You can point to your own >>hook for your environment with >> >> osd crush location hook = /path/to/my/script >> >>sage >> >> >> >>-- >>To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html