Yes, thanks. -Sam On Wed, Jul 2, 2014 at 4:21 PM, Pierre BLONDEAU <pierre.blondeau at unicaen.fr> wrote: > Like that ? > > # ceph --admin-daemon /var/run/ceph/ceph-mon.william.asok version > {"version":"0.82"} > # ceph --admin-daemon /var/run/ceph/ceph-mon.jack.asok version > {"version":"0.82"} > # ceph --admin-daemon /var/run/ceph/ceph-mon.joe.asok version > {"version":"0.82"} > > Pierre > > Le 03/07/2014 01:17, Samuel Just a ?crit : > >> Can you confirm from the admin socket that all monitors are running >> the same version? >> -Sam >> >> On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU >> <pierre.blondeau at unicaen.fr> wrote: >>> >>> Le 03/07/2014 00:55, Samuel Just a ?crit : >>> >>>> Ah, >>>> >>>> ~/logs ? for i in 20 23; do ../ceph/src/osdmaptool --export-crush >>>> /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i > >>>> /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d >>>> ../ceph/src/osdmaptool: osdmap file >>>> 'osd-20_osdmap.13258__0_4E62BB79__none' >>>> ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 >>>> ../ceph/src/osdmaptool: osdmap file >>>> 'osd-23_osdmap.13258__0_4E62BB79__none' >>>> ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 >>>> 6d5 >>>> < tunable chooseleaf_vary_r 1 >>>> >>>> Looks like the chooseleaf_vary_r tunable somehow ended up divergent? >>>> >>>> Pierre: do you recall how and when that got set? >>> >>> >>> >>> I am not sure to understand, but if I good remember after the update in >>> firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I >>> see "feature set mismatch" in log. >>> >>> So if I good remeber, i do : ceph osd crush tunables optimal for the >>> problem >>> of "crush map" and I update my client and server kernel to 3.16rc. >>> >>> It's could be that ? >>> >>> Pierre >>> >>> >>>> -Sam >>>> >>>> On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just <sam.just at inktank.com> >>>> wrote: >>>>> >>>>> >>>>> Yeah, divergent osdmaps: >>>>> 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none >>>>> 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none >>>>> >>>>> Joao: thoughts? >>>>> -Sam >>>>> >>>>> On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU >>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>> >>>>>> >>>>>> The files >>>>>> >>>>>> When I upgrade : >>>>>> ceph-deploy install --stable firefly servers... >>>>>> on each servers service ceph restart mon >>>>>> on each servers service ceph restart osd >>>>>> on each servers service ceph restart mds >>>>>> >>>>>> I upgraded from emperor to firefly. After repair, remap, replace, etc >>>>>> ... I >>>>>> have some PG which pass in peering state. >>>>>> >>>>>> I thought why not try the version 0.82, it could solve my problem. ( >>>>>> It's my mistake ). So, I upgrade from firefly to 0.83 with : >>>>>> ceph-deploy install --testing servers... >>>>>> .. >>>>>> >>>>>> Now, all programs are in version 0.82. >>>>>> I have 3 mons, 36 OSD and 3 mds. >>>>>> >>>>>> Pierre >>>>>> >>>>>> PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta >>>>>> directory. >>>>>> >>>>>> Le 03/07/2014 00:10, Samuel Just a ?crit : >>>>>> >>>>>>> Also, what version did you upgrade from, and how did you upgrade? >>>>>>> -Sam >>>>>>> >>>>>>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just <sam.just at inktank.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Ok, in current/meta on osd 20 and osd 23, please attach all files >>>>>>>> matching >>>>>>>> >>>>>>>> ^osdmap.13258.* >>>>>>>> >>>>>>>> There should be one such file on each osd. (should look something >>>>>>>> like >>>>>>>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, >>>>>>>> you'll want to use find). >>>>>>>> >>>>>>>> What version of ceph is running on your mons? How many mons do you >>>>>>>> have? >>>>>>>> -Sam >>>>>>>> >>>>>>>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU >>>>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I do it, the log files are available here : >>>>>>>>> https://blondeau.users.greyc.fr/cephlog/debug20/ >>>>>>>>> >>>>>>>>> The OSD's files are really big +/- 80M . >>>>>>>>> >>>>>>>>> After starting the osd.20 some other osd crash. I pass from 31 osd >>>>>>>>> up >>>>>>>>> to >>>>>>>>> 16. >>>>>>>>> I remark that after this the number of down+peering PG decrease >>>>>>>>> from >>>>>>>>> 367 >>>>>>>>> to >>>>>>>>> 248. It's "normal" ? May be it's temporary, the time that the >>>>>>>>> cluster >>>>>>>>> verifies all the PG ? >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Pierre >>>>>>>>> >>>>>>>>> Le 02/07/2014 19:16, Samuel Just a ?crit : >>>>>>>>> >>>>>>>>>> You should add >>>>>>>>>> >>>>>>>>>> debug osd = 20 >>>>>>>>>> debug filestore = 20 >>>>>>>>>> debug ms = 1 >>>>>>>>>> >>>>>>>>>> to the [osd] section of the ceph.conf and restart the osds. I'd >>>>>>>>>> like >>>>>>>>>> all three logs if possible. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> -Sam >>>>>>>>>> >>>>>>>>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU >>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Yes, but how i do that ? >>>>>>>>>>> >>>>>>>>>>> With a command like that ? >>>>>>>>>>> >>>>>>>>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 >>>>>>>>>>> --debug-ms >>>>>>>>>>> 1' >>>>>>>>>>> >>>>>>>>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor >>>>>>>>>>> because I >>>>>>>>>>> use >>>>>>>>>>> udev detection. >>>>>>>>>>> >>>>>>>>>>> When I have made these changes, you want the three log files or >>>>>>>>>>> only >>>>>>>>>>> osd.20's ? >>>>>>>>>>> >>>>>>>>>>> Thank you so much for the help >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> Pierre >>>>>>>>>>> >>>>>>>>>>> Le 01/07/2014 23:51, Samuel Just a ?crit : >>>>>>>>>>> >>>>>>>>>>>> Can you reproduce with >>>>>>>>>>>> debug osd = 20 >>>>>>>>>>>> debug filestore = 20 >>>>>>>>>>>> debug ms = 1 >>>>>>>>>>>> ? >>>>>>>>>>>> -Sam >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU >>>>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I join : >>>>>>>>>>>>> - osd.20 is one of osd that I detect which makes crash >>>>>>>>>>>>> other >>>>>>>>>>>>> OSD. >>>>>>>>>>>>> - osd.23 is one of osd which crash when i start osd.20 >>>>>>>>>>>>> - mds, is one of my MDS >>>>>>>>>>>>> >>>>>>>>>>>>> I cut log file because they are to big but. All is here : >>>>>>>>>>>>> https://blondeau.users.greyc.fr/cephlog/ >>>>>>>>>>>>> >>>>>>>>>>>>> Regards >>>>>>>>>>>>> >>>>>>>>>>>>> Le 30/06/2014 17:35, Gregory Farnum a ?crit : >>>>>>>>>>>>> >>>>>>>>>>>>>> What's the backtrace from the crashing OSDs? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Keep in mind that as a dev release, it's generally best not to >>>>>>>>>>>>>> upgrade >>>>>>>>>>>>>> to unnamed versions like 0.82 (but it's probably too late to >>>>>>>>>>>>>> go >>>>>>>>>>>>>> back >>>>>>>>>>>>>> now). >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I will remember it the next time ;) >>>>>>>>>>>>> >>>>>>>>>>>>>> -Greg >>>>>>>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU >>>>>>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> After the upgrade to firefly, I have some PG in peering >>>>>>>>>>>>>>> state. >>>>>>>>>>>>>>> I seen the output of 0.82 so I try to upgrade for solved my >>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> My three MDS crash and some OSD triggers a chain reaction >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> kills >>>>>>>>>>>>>>> other >>>>>>>>>>>>>>> OSD. >>>>>>>>>>>>>>> I think my MDS will not start because of the metadata are on >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> OSD. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have 36 OSD on three servers and I identified 5 OSD which >>>>>>>>>>>>>>> makes >>>>>>>>>>>>>>> crash >>>>>>>>>>>>>>> others. If i not start their, the cluster passe in >>>>>>>>>>>>>>> reconstructive >>>>>>>>>>>>>>> state >>>>>>>>>>>>>>> with >>>>>>>>>>>>>>> 31 OSD but i have 378 in down+peering state. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> How can I do ? Would you more information ( os, crash log, >>>>>>>>>>>>>>> etc >>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>> ) >>>>>>>>>>>>>>> ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> ---------------------------------------------- >>>>>>>>> Pierre BLONDEAU >>>>>>>>> Administrateur Syst?mes & r?seaux >>>>>>>>> Universit? de Caen >>>>>>>>> Laboratoire GREYC, D?partement d'informatique >>>>>>>>> >>>>>>>>> tel : 02 31 56 75 42 >>>>>>>>> bureau : Campus 2, Science 3, 406 >>>>>>>>> ---------------------------------------------- >>>>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> ---------------------------------------------- >>>>>> Pierre BLONDEAU >>>>>> Administrateur Syst?mes & r?seaux >>>>>> Universit? de Caen >>>>>> Laboratoire GREYC, D?partement d'informatique >>>>>> >>>>>> tel : 02 31 56 75 42 >>>>>> bureau : Campus 2, Science 3, 406 >>>>>> ---------------------------------------------- >>> >>> >>> >>> >>> -- >>> ---------------------------------------------- >>> Pierre BLONDEAU >>> Administrateur Syst?mes & r?seaux >>> Universit? de Caen >>> Laboratoire GREYC, D?partement d'informatique >>> >>> tel : 02 31 56 75 42 >>> bureau : Campus 2, Science 3, 406 >>> ---------------------------------------------- >>> > > > -- > ---------------------------------------------- > Pierre BLONDEAU > Administrateur Syst?mes & r?seaux > Universit? de Caen > Laboratoire GREYC, D?partement d'informatique > > tel : 02 31 56 75 42 > bureau : Campus 2, Science 3, 406 > ---------------------------------------------- >