On Fri, 22 Aug 2014, David Moreau Simard wrote: > Hi Wang, > > Thanks, I?ll try that for the time being. This still raises a few > questions I?d like to discuss. > > I?m convinced we can agree that the CRUSH map is ultimately the authority > as far as the location of the devices currently are. > My understanding is that we are relying on another source for device > location when (in this case) restarting OSDs: the ceph.conf file. > > 1) Does this imply that we probably shouldn?t specify device locations > directly in the crush map but in our ceph.conf file instead ? > 2) If what is in the crush map is different than what is configured in > ceph.conf, how does Ceph decide which is the authority ? Shouldn?t it be > the crush map ? In this case, it appears to be the ceph.conf file. > > Just trying to wrap my head around the vision of how things should be > managed. Generally speaking, you have three options: - 'osd crush update on start = false' and do it all manually, like you're used to. - set 'crush location = a=b c=d e=f' in ceph.conf. The expectation is that chef or puppet or whatever will fill this in with "host=foo rack=bar dc=asdf". - customize ceph-crush-location to do something trickier (like multiple trees) sage > -- > David Moreau Simard > > > Le 2014-08-21, 10:57 PM, ? Wang, Zhiqiang ? <zhiqiang.wang@xxxxxxxxx> a > ?crit : > > >Hi David, > > > >Yes, I think adding a hook in your ceph.conf can solve your problem. At > >least this is what I did, and it solves the problem. > > > >For example: > > > >[osd.3] > >osd crush location = "host=osd02 root=default disktype=osd02_ssd" > > > >You need to add this for every osd. > > > >-----Original Message----- > >From: David Moreau Simard [mailto:dmsimard@xxxxxxxx] > >Sent: Friday, August 22, 2014 10:34 AM > >To: Wang, Zhiqiang; Sage Weil > >Cc: 'ceph-devel@xxxxxxxxxxxxxxx' > >Subject: Re: A problem when restarting OSD > > > >I?m glad you mention this because I?ve also been running into the same > >issue and this took me a while to figure out too. > > > >Is this new behaviour ? I don?t remember running into this before... > > > >Sage does mention multiple trees but I?ve had this happen with a single > >root. > >It is definitely not my expectation that restarting an OSD would move > >things around in the crush map. > > > >I?m in the process of developing a crush map, looks like this (note: > >unfinished and does not make much sense as is): > >http://pastebin.com/6vBUQTCk > >This results in this tree: > ># id weight type name up/down reweight > >-1 18 root default > >-2 9 host osd02 > >-4 2 disktype osd02_ssd > >3 1 osd.3 up 1 > >9 1 osd.9 up 1 > >-5 7 disktype osd02_spinning > >8 1 osd.8 up 1 > >17 1 osd.17 up 1 > >5 1 osd.5 up 1 > >11 1 osd.11 up 1 > >1 1 osd.1 up 1 > >13 1 osd.13 up 1 > >15 1 osd.15 up 1 > >-3 9 host osd01 > >-6 2 disktype osd01_ssd > >2 1 osd.2 up 1 > >7 1 osd.7 up 1 > >-7 7 disktype osd01_spinning > >0 1 osd.0 up 1 > >4 1 osd.4 up 1 > >12 1 osd.12 up 1 > >6 1 osd.6 up 1 > >14 1 osd.14 up 1 > >10 1 osd.10 up 1 > >16 1 osd.16 up 1 > > > >Only restarting the OSDs on both hosts modifies the crush map: > >http://pastebin.com/rP8Y8qcH > >With the resulting tree: > ># id weight type name up/down reweight > >-1 18 root default > >-2 9 host osd02 > >-4 0 disktype osd02_ssd > >-5 0 disktype osd02_spinning > >13 1 osd.13 up 1 > >3 1 osd.3 up 1 > >5 1 osd.5 up 1 > >1 1 osd.1 up 1 > >11 1 osd.11 up 1 > >15 1 osd.15 up 1 > >17 1 osd.17 up 1 > >8 1 osd.8 up 1 > >9 1 osd.9 up 1 > >-3 9 host osd01 > >-6 0 disktype osd01_ssd > >-7 0 disktype osd01_spinning > >0 1 osd.0 up 1 > >10 1 osd.10 up 1 > >12 1 osd.12 up 1 > >14 1 osd.14 up 1 > >16 1 osd.16 up 1 > >2 1 osd.2 up 1 > >4 1 osd.4 up 1 > >7 1 osd.7 up 1 > >6 1 osd.6 up 1 > > > >Would a hook really be the solution I need ? > >-- > >David Moreau Simard > > > >Le 2014-08-21, 9:36 PM, ? Wang, Zhiqiang ? <zhiqiang.wang@xxxxxxxxx> a > >?crit : > > > >>Hi Sage, > >> > >>Yes, I understand that we can customize the crush location hook to let > >>the OSD go to the right location. But does the ceph user have the idea > >>of this if he/she has more than 1 root in the crush map? At least I > >>don't know this at the beginning. We need to either emphasize this or > >>do it in some ways for the user. > >> > >>One question for the hot-swapping support of moving an OSD to another > >>host. What if the journal is not located at the same disk of the OSD? > >>Is the OSD still able to be available in the cluster? > >> > >>-----Original Message----- > >>From: Sage Weil [mailto:sweil@xxxxxxxxxx] > >>Sent: Thursday, August 21, 2014 11:28 PM > >>To: Wang, Zhiqiang > >>Cc: 'ceph-devel@xxxxxxxxxxxxxxx' > >>Subject: Re: A problem when restarting OSD > >> > >>On Thu, 21 Aug 2014, Wang, Zhiqiang wrote: > >>> Hi all, > >>> > >>> I ran into a problem when restarting an OSD. > >>> > >>> Here is my OSD tree before restarting the OSD: > >>> > >>> # id weight type name up/down reweight > >>> -6 8 root ssd > >>> -4 4 host zqw-s1-ssd > >>> 16 1 osd.16 up 1 > >>> 17 1 osd.17 up 1 > >>> 18 1 osd.18 up 1 > >>> 19 1 osd.19 up 1 > >>> -5 4 host zqw-s2-ssd > >>> 20 1 osd.20 up 1 > >>> 21 1 osd.21 up 1 > >>> 22 1 osd.22 up 1 > >>> 23 1 osd.23 up 1 > >>> -1 14.56 root default > >>> -2 7.28 host zqw-s1 > >>> 0 0.91 osd.0 up 1 > >>> 1 0.91 osd.1 up 1 > >>> 2 0.91 osd.2 up 1 > >>> 3 0.91 osd.3 up 1 > >>> 4 0.91 osd.4 up 1 > >>> 5 0.91 osd.5 up 1 > >>> 6 0.91 osd.6 up 1 > >>> 7 0.91 osd.7 up 1 > >>> -3 7.28 host zqw-s2 > >>> 8 0.91 osd.8 up 1 > >>> 9 0.91 osd.9 up 1 > >>> 10 0.91 osd.10 up 1 > >>> 11 0.91 osd.11 up 1 > >>> 12 0.91 osd.12 up 1 > >>> 13 0.91 osd.13 up 1 > >>> 14 0.91 osd.14 up 1 > >>> 15 0.91 osd.15 up 1 > >>> > >>> After I restart one of the OSD with id from 16 to 23, say restarting > >>>osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph > >>>cluster begins to do rebalance. This surely is not what I want. > >>> > >>> # id weight type name up/down reweight > >>> -6 7 root ssd > >>> -4 3 host zqw-s1-ssd > >>> 17 1 osd.17 up 1 > >>> 18 1 osd.18 up 1 > >>> 19 1 osd.19 up 1 > >>> -5 4 host zqw-s2-ssd > >>> 20 1 osd.20 up 1 > >>> 21 1 osd.21 up 1 > >>> 22 1 osd.22 up 1 > >>> 23 1 osd.23 up 1 > >>> -1 15.56 root default > >>> -2 8.28 host zqw-s1 > >>> 0 0.91 osd.0 up 1 > >>> 1 0.91 osd.1 up 1 > >>> 2 0.91 osd.2 up 1 > >>> 3 0.91 osd.3 up 1 > >>> 4 0.91 osd.4 up 1 > >>> 5 0.91 osd.5 up 1 > >>> 6 0.91 osd.6 up 1 > >>> 7 0.91 osd.7 up 1 > >>> 16 1 osd.16 up 1 > >>> -3 7.28 host zqw-s2 > >>> 8 0.91 osd.8 up 1 > >>> 9 0.91 osd.9 up 1 > >>> 10 0.91 osd.10 up 1 > >>> 11 0.91 osd.11 up 1 > >>> 12 0.91 osd.12 up 1 > >>> 13 0.91 osd.13 up 1 > >>> 14 0.91 osd.14 up 1 > >>> 15 0.91 osd.15 up 1 > >>> > >>> After digging into the problem, I find it's because in the ceph init > >>>script, we change the OSD's crush location in some way. It uses the > >>>script 'ceph-crush-location' to get the crush location from the > >>>ceph.conf file for the restarting OSD. If there isn't such an entry in > >>>ceph.conf, it uses the default one 'host=$(hostname -s) root=default'. > >>>Since I don't have the crush location configuration in my ceph.conf (I > >>>guess most of people don't have this in their ceph.conf), when I > >>>restarting osd.16, it goes to 'root default' and 'host zqw-s1'. > >>> > >>> Here is a fix for this: > >>> When the ceph init script uses 'ceph osd crush create-or-move' to > >>> change the OSD's crush location, do a check first, if this OSD is > >>> already existing in the crush map, return without making the location > >>> change. This change is at: > >>> https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd153987 > >>> 8 > >>> 761412fe > >>> > >>> What do you think? > >> > >>The goal of this behavior is to allow hot-swapping of devices. You can > >>pull disks out of one host and put them in another and the udev > >>machinery will start up the daemon, update the crush location, and the > >>disk and data will become available. It's not 'ideal' in the sense > >>that there will be rebalancing, but it does make the data available to > >>the cluster to preserve data safety. > >> > >>We haven't come up with a great scheme yet to managing multiple trees > >>yet. > >>The idea is that the ceph-crush-location hook can be customized to do > >>whatever is necessary, for example by putting root=ssd if the device > >>type appears to be an ssd (maybe look at the sysfs metadata, or put a > >>marker file in the osd data directory?). You can point to your own > >>hook for your environment with > >> > >> osd crush location hook = /path/to/my/script > >> > >>sage > >> > >> > >> > >>-- > >>To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >>in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > >>info at http://vger.kernel.org/majordomo-info.html > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html