Le 03/07/2014 13:49, Joao Eduardo Luis a ?crit : > On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: >> Le 03/07/2014 00:55, Samuel Just a ?crit : >>> Ah, >>> >>> ~/logs ? for i in 20 23; do ../ceph/src/osdmaptool --export-crush >>> /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i > >>> /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d >>> ../ceph/src/osdmaptool: osdmap file >>> 'osd-20_osdmap.13258__0_4E62BB79__none' >>> ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 >>> ../ceph/src/osdmaptool: osdmap file >>> 'osd-23_osdmap.13258__0_4E62BB79__none' >>> ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 >>> 6d5 >>> < tunable chooseleaf_vary_r 1 >>> >>> Looks like the chooseleaf_vary_r tunable somehow ended up divergent? > > The only thing that comes to mind that could cause this is if we changed > the leader's in-memory map, proposed it, it failed, and only the leader > got to write the map to disk somehow. This happened once on a totally > different issue (although I can't pinpoint right now which). > > In such a scenario, the leader would serve the incorrect osdmap to > whoever asked osdmaps from it, the remaining quorum would serve the > correct osdmaps to all the others. This could cause this divergence. Or > it could be something else. > > Are there logs for the monitors for the timeframe this may have happened > in? Which exactly timeframe you want ? I have 7 days of logs, I should have informations about the upgrade from firefly to 0.82. Which mon's log do you want ? Three ? Regards > -Joao > >>> >>> Pierre: do you recall how and when that got set? >> >> I am not sure to understand, but if I good remember after the update in >> firefly, I was in state : HEALTH_WARN crush map has legacy tunables and >> I see "feature set mismatch" in log. >> >> So if I good remeber, i do : ceph osd crush tunables optimal for the >> problem of "crush map" and I update my client and server kernel to >> 3.16rc. >> >> It's could be that ? >> >> Pierre >> >>> -Sam >>> >>> On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just <sam.just at inktank.com> >>> wrote: >>>> Yeah, divergent osdmaps: >>>> 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none >>>> 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none >>>> >>>> Joao: thoughts? >>>> -Sam >>>> >>>> On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU >>>> <pierre.blondeau at unicaen.fr> wrote: >>>>> The files >>>>> >>>>> When I upgrade : >>>>> ceph-deploy install --stable firefly servers... >>>>> on each servers service ceph restart mon >>>>> on each servers service ceph restart osd >>>>> on each servers service ceph restart mds >>>>> >>>>> I upgraded from emperor to firefly. After repair, remap, replace, >>>>> etc ... I >>>>> have some PG which pass in peering state. >>>>> >>>>> I thought why not try the version 0.82, it could solve my problem. ( >>>>> It's my mistake ). So, I upgrade from firefly to 0.83 with : >>>>> ceph-deploy install --testing servers... >>>>> .. >>>>> >>>>> Now, all programs are in version 0.82. >>>>> I have 3 mons, 36 OSD and 3 mds. >>>>> >>>>> Pierre >>>>> >>>>> PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta >>>>> directory. >>>>> >>>>> Le 03/07/2014 00:10, Samuel Just a ?crit : >>>>> >>>>>> Also, what version did you upgrade from, and how did you upgrade? >>>>>> -Sam >>>>>> >>>>>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just <sam.just at inktank.com> >>>>>> wrote: >>>>>>> >>>>>>> Ok, in current/meta on osd 20 and osd 23, please attach all files >>>>>>> matching >>>>>>> >>>>>>> ^osdmap.13258.* >>>>>>> >>>>>>> There should be one such file on each osd. (should look something >>>>>>> like >>>>>>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, >>>>>>> you'll want to use find). >>>>>>> >>>>>>> What version of ceph is running on your mons? How many mons do >>>>>>> you have? >>>>>>> -Sam >>>>>>> >>>>>>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU >>>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I do it, the log files are available here : >>>>>>>> https://blondeau.users.greyc.fr/cephlog/debug20/ >>>>>>>> >>>>>>>> The OSD's files are really big +/- 80M . >>>>>>>> >>>>>>>> After starting the osd.20 some other osd crash. I pass from 31 >>>>>>>> osd up to >>>>>>>> 16. >>>>>>>> I remark that after this the number of down+peering PG decrease >>>>>>>> from 367 >>>>>>>> to >>>>>>>> 248. It's "normal" ? May be it's temporary, the time that the >>>>>>>> cluster >>>>>>>> verifies all the PG ? >>>>>>>> >>>>>>>> Regards >>>>>>>> Pierre >>>>>>>> >>>>>>>> Le 02/07/2014 19:16, Samuel Just a ?crit : >>>>>>>> >>>>>>>>> You should add >>>>>>>>> >>>>>>>>> debug osd = 20 >>>>>>>>> debug filestore = 20 >>>>>>>>> debug ms = 1 >>>>>>>>> >>>>>>>>> to the [osd] section of the ceph.conf and restart the osds. I'd >>>>>>>>> like >>>>>>>>> all three logs if possible. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> -Sam >>>>>>>>> >>>>>>>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU >>>>>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes, but how i do that ? >>>>>>>>>> >>>>>>>>>> With a command like that ? >>>>>>>>>> >>>>>>>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 >>>>>>>>>> --debug-ms >>>>>>>>>> 1' >>>>>>>>>> >>>>>>>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor >>>>>>>>>> because I >>>>>>>>>> use >>>>>>>>>> udev detection. >>>>>>>>>> >>>>>>>>>> When I have made these changes, you want the three log files or >>>>>>>>>> only >>>>>>>>>> osd.20's ? >>>>>>>>>> >>>>>>>>>> Thank you so much for the help >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> Pierre >>>>>>>>>> >>>>>>>>>> Le 01/07/2014 23:51, Samuel Just a ?crit : >>>>>>>>>> >>>>>>>>>>> Can you reproduce with >>>>>>>>>>> debug osd = 20 >>>>>>>>>>> debug filestore = 20 >>>>>>>>>>> debug ms = 1 >>>>>>>>>>> ? >>>>>>>>>>> -Sam >>>>>>>>>>> >>>>>>>>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU >>>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I join : >>>>>>>>>>>> - osd.20 is one of osd that I detect which makes crash >>>>>>>>>>>> other >>>>>>>>>>>> OSD. >>>>>>>>>>>> - osd.23 is one of osd which crash when i start osd.20 >>>>>>>>>>>> - mds, is one of my MDS >>>>>>>>>>>> >>>>>>>>>>>> I cut log file because they are to big but. All is here : >>>>>>>>>>>> https://blondeau.users.greyc.fr/cephlog/ >>>>>>>>>>>> >>>>>>>>>>>> Regards >>>>>>>>>>>> >>>>>>>>>>>> Le 30/06/2014 17:35, Gregory Farnum a ?crit : >>>>>>>>>>>> >>>>>>>>>>>>> What's the backtrace from the crashing OSDs? >>>>>>>>>>>>> >>>>>>>>>>>>> Keep in mind that as a dev release, it's generally best not to >>>>>>>>>>>>> upgrade >>>>>>>>>>>>> to unnamed versions like 0.82 (but it's probably too late >>>>>>>>>>>>> to go >>>>>>>>>>>>> back >>>>>>>>>>>>> now). >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I will remember it the next time ;) >>>>>>>>>>>> >>>>>>>>>>>>> -Greg >>>>>>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU >>>>>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> After the upgrade to firefly, I have some PG in peering >>>>>>>>>>>>>> state. >>>>>>>>>>>>>> I seen the output of 0.82 so I try to upgrade for solved my >>>>>>>>>>>>>> problem. >>>>>>>>>>>>>> >>>>>>>>>>>>>> My three MDS crash and some OSD triggers a chain reaction >>>>>>>>>>>>>> that >>>>>>>>>>>>>> kills >>>>>>>>>>>>>> other >>>>>>>>>>>>>> OSD. >>>>>>>>>>>>>> I think my MDS will not start because of the metadata are >>>>>>>>>>>>>> on the >>>>>>>>>>>>>> OSD. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have 36 OSD on three servers and I identified 5 OSD which >>>>>>>>>>>>>> makes >>>>>>>>>>>>>> crash >>>>>>>>>>>>>> others. If i not start their, the cluster passe in >>>>>>>>>>>>>> reconstructive >>>>>>>>>>>>>> state >>>>>>>>>>>>>> with >>>>>>>>>>>>>> 31 OSD but i have 378 in down+peering state. >>>>>>>>>>>>>> >>>>>>>>>>>>>> How can I do ? Would you more information ( os, crash log, >>>>>>>>>>>>>> etc ... >>>>>>>>>>>>>> ) >>>>>>>>>>>>>> ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> ---------------------------------------------- >>>>>>>> Pierre BLONDEAU >>>>>>>> Administrateur Syst?mes & r?seaux >>>>>>>> Universit? de Caen >>>>>>>> Laboratoire GREYC, D?partement d'informatique >>>>>>>> >>>>>>>> tel : 02 31 56 75 42 >>>>>>>> bureau : Campus 2, Science 3, 406 >>>>>>>> ---------------------------------------------- >>>>>>>> >>>>> >>>>> >>>>> -- >>>>> ---------------------------------------------- >>>>> Pierre BLONDEAU >>>>> Administrateur Syst?mes & r?seaux >>>>> Universit? de Caen >>>>> Laboratoire GREYC, D?partement d'informatique >>>>> >>>>> tel : 02 31 56 75 42 >>>>> bureau : Campus 2, Science 3, 406 >>>>> ---------------------------------------------- >> >> > > -- ---------------------------------------------- Pierre BLONDEAU Administrateur Syst?mes & r?seaux Universit? de Caen Laboratoire GREYC, D?partement d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 ---------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2947 bytes Desc: Signature cryptographique S/MIME URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140703/808c7808/attachment.bin>