Can you confirm from the admin socket that all monitors are running the same version? -Sam On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU <pierre.blondeau at unicaen.fr> wrote: > Le 03/07/2014 00:55, Samuel Just a ?crit : > >> Ah, >> >> ~/logs ? for i in 20 23; do ../ceph/src/osdmaptool --export-crush >> /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i > >> /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d >> ../ceph/src/osdmaptool: osdmap file >> 'osd-20_osdmap.13258__0_4E62BB79__none' >> ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 >> ../ceph/src/osdmaptool: osdmap file >> 'osd-23_osdmap.13258__0_4E62BB79__none' >> ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 >> 6d5 >> < tunable chooseleaf_vary_r 1 >> >> Looks like the chooseleaf_vary_r tunable somehow ended up divergent? >> >> Pierre: do you recall how and when that got set? > > > I am not sure to understand, but if I good remember after the update in > firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I > see "feature set mismatch" in log. > > So if I good remeber, i do : ceph osd crush tunables optimal for the problem > of "crush map" and I update my client and server kernel to 3.16rc. > > It's could be that ? > > Pierre > > >> -Sam >> >> On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just <sam.just at inktank.com> wrote: >>> >>> Yeah, divergent osdmaps: >>> 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none >>> 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none >>> >>> Joao: thoughts? >>> -Sam >>> >>> On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU >>> <pierre.blondeau at unicaen.fr> wrote: >>>> >>>> The files >>>> >>>> When I upgrade : >>>> ceph-deploy install --stable firefly servers... >>>> on each servers service ceph restart mon >>>> on each servers service ceph restart osd >>>> on each servers service ceph restart mds >>>> >>>> I upgraded from emperor to firefly. After repair, remap, replace, etc >>>> ... I >>>> have some PG which pass in peering state. >>>> >>>> I thought why not try the version 0.82, it could solve my problem. ( >>>> It's my mistake ). So, I upgrade from firefly to 0.83 with : >>>> ceph-deploy install --testing servers... >>>> .. >>>> >>>> Now, all programs are in version 0.82. >>>> I have 3 mons, 36 OSD and 3 mds. >>>> >>>> Pierre >>>> >>>> PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta >>>> directory. >>>> >>>> Le 03/07/2014 00:10, Samuel Just a ?crit : >>>> >>>>> Also, what version did you upgrade from, and how did you upgrade? >>>>> -Sam >>>>> >>>>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just <sam.just at inktank.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> Ok, in current/meta on osd 20 and osd 23, please attach all files >>>>>> matching >>>>>> >>>>>> ^osdmap.13258.* >>>>>> >>>>>> There should be one such file on each osd. (should look something like >>>>>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, >>>>>> you'll want to use find). >>>>>> >>>>>> What version of ceph is running on your mons? How many mons do you >>>>>> have? >>>>>> -Sam >>>>>> >>>>>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU >>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I do it, the log files are available here : >>>>>>> https://blondeau.users.greyc.fr/cephlog/debug20/ >>>>>>> >>>>>>> The OSD's files are really big +/- 80M . >>>>>>> >>>>>>> After starting the osd.20 some other osd crash. I pass from 31 osd up >>>>>>> to >>>>>>> 16. >>>>>>> I remark that after this the number of down+peering PG decrease from >>>>>>> 367 >>>>>>> to >>>>>>> 248. It's "normal" ? May be it's temporary, the time that the cluster >>>>>>> verifies all the PG ? >>>>>>> >>>>>>> Regards >>>>>>> Pierre >>>>>>> >>>>>>> Le 02/07/2014 19:16, Samuel Just a ?crit : >>>>>>> >>>>>>>> You should add >>>>>>>> >>>>>>>> debug osd = 20 >>>>>>>> debug filestore = 20 >>>>>>>> debug ms = 1 >>>>>>>> >>>>>>>> to the [osd] section of the ceph.conf and restart the osds. I'd >>>>>>>> like >>>>>>>> all three logs if possible. >>>>>>>> >>>>>>>> Thanks >>>>>>>> -Sam >>>>>>>> >>>>>>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU >>>>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Yes, but how i do that ? >>>>>>>>> >>>>>>>>> With a command like that ? >>>>>>>>> >>>>>>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 >>>>>>>>> --debug-ms >>>>>>>>> 1' >>>>>>>>> >>>>>>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor >>>>>>>>> because I >>>>>>>>> use >>>>>>>>> udev detection. >>>>>>>>> >>>>>>>>> When I have made these changes, you want the three log files or >>>>>>>>> only >>>>>>>>> osd.20's ? >>>>>>>>> >>>>>>>>> Thank you so much for the help >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Pierre >>>>>>>>> >>>>>>>>> Le 01/07/2014 23:51, Samuel Just a ?crit : >>>>>>>>> >>>>>>>>>> Can you reproduce with >>>>>>>>>> debug osd = 20 >>>>>>>>>> debug filestore = 20 >>>>>>>>>> debug ms = 1 >>>>>>>>>> ? >>>>>>>>>> -Sam >>>>>>>>>> >>>>>>>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU >>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I join : >>>>>>>>>>> - osd.20 is one of osd that I detect which makes crash other >>>>>>>>>>> OSD. >>>>>>>>>>> - osd.23 is one of osd which crash when i start osd.20 >>>>>>>>>>> - mds, is one of my MDS >>>>>>>>>>> >>>>>>>>>>> I cut log file because they are to big but. All is here : >>>>>>>>>>> https://blondeau.users.greyc.fr/cephlog/ >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> Le 30/06/2014 17:35, Gregory Farnum a ?crit : >>>>>>>>>>> >>>>>>>>>>>> What's the backtrace from the crashing OSDs? >>>>>>>>>>>> >>>>>>>>>>>> Keep in mind that as a dev release, it's generally best not to >>>>>>>>>>>> upgrade >>>>>>>>>>>> to unnamed versions like 0.82 (but it's probably too late to go >>>>>>>>>>>> back >>>>>>>>>>>> now). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I will remember it the next time ;) >>>>>>>>>>> >>>>>>>>>>>> -Greg >>>>>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU >>>>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> After the upgrade to firefly, I have some PG in peering state. >>>>>>>>>>>>> I seen the output of 0.82 so I try to upgrade for solved my >>>>>>>>>>>>> problem. >>>>>>>>>>>>> >>>>>>>>>>>>> My three MDS crash and some OSD triggers a chain reaction that >>>>>>>>>>>>> kills >>>>>>>>>>>>> other >>>>>>>>>>>>> OSD. >>>>>>>>>>>>> I think my MDS will not start because of the metadata are on >>>>>>>>>>>>> the >>>>>>>>>>>>> OSD. >>>>>>>>>>>>> >>>>>>>>>>>>> I have 36 OSD on three servers and I identified 5 OSD which >>>>>>>>>>>>> makes >>>>>>>>>>>>> crash >>>>>>>>>>>>> others. If i not start their, the cluster passe in >>>>>>>>>>>>> reconstructive >>>>>>>>>>>>> state >>>>>>>>>>>>> with >>>>>>>>>>>>> 31 OSD but i have 378 in down+peering state. >>>>>>>>>>>>> >>>>>>>>>>>>> How can I do ? Would you more information ( os, crash log, etc >>>>>>>>>>>>> ... >>>>>>>>>>>>> ) >>>>>>>>>>>>> ? >>>>>>>>>>>>> >>>>>>>>>>>>> Regards >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> ---------------------------------------------- >>>>>>> Pierre BLONDEAU >>>>>>> Administrateur Syst?mes & r?seaux >>>>>>> Universit? de Caen >>>>>>> Laboratoire GREYC, D?partement d'informatique >>>>>>> >>>>>>> tel : 02 31 56 75 42 >>>>>>> bureau : Campus 2, Science 3, 406 >>>>>>> ---------------------------------------------- >>>>>>> >>>> >>>> >>>> -- >>>> ---------------------------------------------- >>>> Pierre BLONDEAU >>>> Administrateur Syst?mes & r?seaux >>>> Universit? de Caen >>>> Laboratoire GREYC, D?partement d'informatique >>>> >>>> tel : 02 31 56 75 42 >>>> bureau : Campus 2, Science 3, 406 >>>> ---------------------------------------------- > > > > -- > ---------------------------------------------- > Pierre BLONDEAU > Administrateur Syst?mes & r?seaux > Universit? de Caen > Laboratoire GREYC, D?partement d'informatique > > tel : 02 31 56 75 42 > bureau : Campus 2, Science 3, 406 > ---------------------------------------------- >