Ah, that does clear things up ! I didn¹t even know that there was a toggle for Œosd crush update on start¹ - my bad. I searched through the documentation and couldn¹t find something on that topic. Perhaps we should add a bit about that here: http://ceph.com/docs/master/rados/operations/crush-map/#crush-location I¹ll open a pull request. -- David Moreau Simard Le 2014-08-22, 10:06 AM, « Sage Weil » <sweil@xxxxxxxxxx> a écrit : >On Fri, 22 Aug 2014, David Moreau Simard wrote: >> Hi Wang, >> >> Thanks, I?ll try that for the time being. This still raises a few >> questions I?d like to discuss. >> >> I?m convinced we can agree that the CRUSH map is ultimately the >>authority >> as far as the location of the devices currently are. >> My understanding is that we are relying on another source for device >> location when (in this case) restarting OSDs: the ceph.conf file. >> >> 1) Does this imply that we probably shouldn?t specify device locations >> directly in the crush map but in our ceph.conf file instead ? >> 2) If what is in the crush map is different than what is configured in >> ceph.conf, how does Ceph decide which is the authority ? Shouldn?t it be >> the crush map ? In this case, it appears to be the ceph.conf file. >> >> Just trying to wrap my head around the vision of how things should be >> managed. > >Generally speaking, you have three options: > > - 'osd crush update on start = false' and do it all manually, like >you're > used to. > - set 'crush location = a=b c=d e=f' in ceph.conf. The expectation is > that chef or puppet or whatever will fill this in with "host=foo > rack=bar dc=asdf". > - customize ceph-crush-location to do something trickier (like multiple > trees) > >sage > > >> -- >> David Moreau Simard >> >> >> Le 2014-08-21, 10:57 PM, ? Wang, Zhiqiang ? <zhiqiang.wang@xxxxxxxxx> a >> ?crit : >> >> >Hi David, >> > >> >Yes, I think adding a hook in your ceph.conf can solve your problem. At >> >least this is what I did, and it solves the problem. >> > >> >For example: >> > >> >[osd.3] >> >osd crush location = "host=osd02 root=default disktype=osd02_ssd" >> > >> >You need to add this for every osd. >> > >> >-----Original Message----- >> >From: David Moreau Simard [mailto:dmsimard@xxxxxxxx] >> >Sent: Friday, August 22, 2014 10:34 AM >> >To: Wang, Zhiqiang; Sage Weil >> >Cc: 'ceph-devel@xxxxxxxxxxxxxxx' >> >Subject: Re: A problem when restarting OSD >> > >> >I?m glad you mention this because I?ve also been running into the same >> >issue and this took me a while to figure out too. >> > >> >Is this new behaviour ? I don?t remember running into this before... >> > >> >Sage does mention multiple trees but I?ve had this happen with a single >> >root. >> >It is definitely not my expectation that restarting an OSD would move >> >things around in the crush map. >> > >> >I?m in the process of developing a crush map, looks like this (note: >> >unfinished and does not make much sense as is): >> >http://pastebin.com/6vBUQTCk >> >This results in this tree: >> ># id weight type name up/down reweight >> >-1 18 root default >> >-2 9 host osd02 >> >-4 2 disktype osd02_ssd >> >3 1 osd.3 up 1 >> >9 1 osd.9 up 1 >> >-5 7 disktype osd02_spinning >> >8 1 osd.8 up 1 >> >17 1 osd.17 up 1 >> >5 1 osd.5 up 1 >> >11 1 osd.11 up 1 >> >1 1 osd.1 up 1 >> >13 1 osd.13 up 1 >> >15 1 osd.15 up 1 >> >-3 9 host osd01 >> >-6 2 disktype osd01_ssd >> >2 1 osd.2 up 1 >> >7 1 osd.7 up 1 >> >-7 7 disktype osd01_spinning >> >0 1 osd.0 up 1 >> >4 1 osd.4 up 1 >> >12 1 osd.12 up 1 >> >6 1 osd.6 up 1 >> >14 1 osd.14 up 1 >> >10 1 osd.10 up 1 >> >16 1 osd.16 up 1 >> > >> >Only restarting the OSDs on both hosts modifies the crush map: >> >http://pastebin.com/rP8Y8qcH >> >With the resulting tree: >> ># id weight type name up/down reweight >> >-1 18 root default >> >-2 9 host osd02 >> >-4 0 disktype osd02_ssd >> >-5 0 disktype osd02_spinning >> >13 1 osd.13 up 1 >> >3 1 osd.3 up 1 >> >5 1 osd.5 up 1 >> >1 1 osd.1 up 1 >> >11 1 osd.11 up 1 >> >15 1 osd.15 up 1 >> >17 1 osd.17 up 1 >> >8 1 osd.8 up 1 >> >9 1 osd.9 up 1 >> >-3 9 host osd01 >> >-6 0 disktype osd01_ssd >> >-7 0 disktype osd01_spinning >> >0 1 osd.0 up 1 >> >10 1 osd.10 up 1 >> >12 1 osd.12 up 1 >> >14 1 osd.14 up 1 >> >16 1 osd.16 up 1 >> >2 1 osd.2 up 1 >> >4 1 osd.4 up 1 >> >7 1 osd.7 up 1 >> >6 1 osd.6 up 1 >> > >> >Would a hook really be the solution I need ? >> >-- >> >David Moreau Simard >> > >> >Le 2014-08-21, 9:36 PM, ? Wang, Zhiqiang ? <zhiqiang.wang@xxxxxxxxx> a >> >?crit : >> > >> >>Hi Sage, >> >> >> >>Yes, I understand that we can customize the crush location hook to let >> >>the OSD go to the right location. But does the ceph user have the idea >> >>of this if he/she has more than 1 root in the crush map? At least I >> >>don't know this at the beginning. We need to either emphasize this or >> >>do it in some ways for the user. >> >> >> >>One question for the hot-swapping support of moving an OSD to another >> >>host. What if the journal is not located at the same disk of the OSD? >> >>Is the OSD still able to be available in the cluster? >> >> >> >>-----Original Message----- >> >>From: Sage Weil [mailto:sweil@xxxxxxxxxx] >> >>Sent: Thursday, August 21, 2014 11:28 PM >> >>To: Wang, Zhiqiang >> >>Cc: 'ceph-devel@xxxxxxxxxxxxxxx' >> >>Subject: Re: A problem when restarting OSD >> >> >> >>On Thu, 21 Aug 2014, Wang, Zhiqiang wrote: >> >>> Hi all, >> >>> >> >>> I ran into a problem when restarting an OSD. >> >>> >> >>> Here is my OSD tree before restarting the OSD: >> >>> >> >>> # id weight type name up/down reweight >> >>> -6 8 root ssd >> >>> -4 4 host zqw-s1-ssd >> >>> 16 1 osd.16 up 1 >> >>> 17 1 osd.17 up 1 >> >>> 18 1 osd.18 up 1 >> >>> 19 1 osd.19 up 1 >> >>> -5 4 host zqw-s2-ssd >> >>> 20 1 osd.20 up 1 >> >>> 21 1 osd.21 up 1 >> >>> 22 1 osd.22 up 1 >> >>> 23 1 osd.23 up 1 >> >>> -1 14.56 root default >> >>> -2 7.28 host zqw-s1 >> >>> 0 0.91 osd.0 up 1 >> >>> 1 0.91 osd.1 up 1 >> >>> 2 0.91 osd.2 up 1 >> >>> 3 0.91 osd.3 up 1 >> >>> 4 0.91 osd.4 up 1 >> >>> 5 0.91 osd.5 up 1 >> >>> 6 0.91 osd.6 up 1 >> >>> 7 0.91 osd.7 up 1 >> >>> -3 7.28 host zqw-s2 >> >>> 8 0.91 osd.8 up 1 >> >>> 9 0.91 osd.9 up 1 >> >>> 10 0.91 osd.10 up 1 >> >>> 11 0.91 osd.11 up 1 >> >>> 12 0.91 osd.12 up 1 >> >>> 13 0.91 osd.13 up 1 >> >>> 14 0.91 osd.14 up 1 >> >>> 15 0.91 osd.15 up 1 >> >>> >> >>> After I restart one of the OSD with id from 16 to 23, say restarting >> >>>osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph >> >>>cluster begins to do rebalance. This surely is not what I want. >> >>> >> >>> # id weight type name up/down reweight >> >>> -6 7 root ssd >> >>> -4 3 host zqw-s1-ssd >> >>> 17 1 osd.17 up 1 >> >>> 18 1 osd.18 up 1 >> >>> 19 1 osd.19 up 1 >> >>> -5 4 host zqw-s2-ssd >> >>> 20 1 osd.20 up 1 >> >>> 21 1 osd.21 up 1 >> >>> 22 1 osd.22 up 1 >> >>> 23 1 osd.23 up 1 >> >>> -1 15.56 root default >> >>> -2 8.28 host zqw-s1 >> >>> 0 0.91 osd.0 up 1 >> >>> 1 0.91 osd.1 up 1 >> >>> 2 0.91 osd.2 up 1 >> >>> 3 0.91 osd.3 up 1 >> >>> 4 0.91 osd.4 up 1 >> >>> 5 0.91 osd.5 up 1 >> >>> 6 0.91 osd.6 up 1 >> >>> 7 0.91 osd.7 up 1 >> >>> 16 1 osd.16 up 1 >> >>> -3 7.28 host zqw-s2 >> >>> 8 0.91 osd.8 up 1 >> >>> 9 0.91 osd.9 up 1 >> >>> 10 0.91 osd.10 up 1 >> >>> 11 0.91 osd.11 up 1 >> >>> 12 0.91 osd.12 up 1 >> >>> 13 0.91 osd.13 up 1 >> >>> 14 0.91 osd.14 up 1 >> >>> 15 0.91 osd.15 up 1 >> >>> >> >>> After digging into the problem, I find it's because in the ceph init >> >>>script, we change the OSD's crush location in some way. It uses the >> >>>script 'ceph-crush-location' to get the crush location from the >> >>>ceph.conf file for the restarting OSD. If there isn't such an entry >>in >> >>>ceph.conf, it uses the default one 'host=$(hostname -s) >>root=default'. >> >>>Since I don't have the crush location configuration in my ceph.conf >>(I >> >>>guess most of people don't have this in their ceph.conf), when I >> >>>restarting osd.16, it goes to 'root default' and 'host zqw-s1'. >> >>> >> >>> Here is a fix for this: >> >>> When the ceph init script uses 'ceph osd crush create-or-move' to >> >>> change the OSD's crush location, do a check first, if this OSD is >> >>> already existing in the crush map, return without making the >>location >> >>> change. This change is at: >> >>> >>https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd153987 >> >>> 8 >> >>> 761412fe >> >>> >> >>> What do you think? >> >> >> >>The goal of this behavior is to allow hot-swapping of devices. You >>can >> >>pull disks out of one host and put them in another and the udev >> >>machinery will start up the daemon, update the crush location, and the >> >>disk and data will become available. It's not 'ideal' in the sense >> >>that there will be rebalancing, but it does make the data available to >> >>the cluster to preserve data safety. >> >> >> >>We haven't come up with a great scheme yet to managing multiple trees >> >>yet. >> >>The idea is that the ceph-crush-location hook can be customized to do >> >>whatever is necessary, for example by putting root=ssd if the device >> >>type appears to be an ssd (maybe look at the sysfs metadata, or put a >> >>marker file in the osd data directory?). You can point to your own >> >>hook for your environment with >> >> >> >> osd crush location hook = /path/to/my/script >> >> >> >>sage >> >> >> >> >> >> >> >>-- >> >>To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> >>in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >> >>info at http://vger.kernel.org/majordomo-info.html >> > >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html