Some OSD and MDS crash

sam.just@xxxxxxxxxxx (Samuel Just) · Wed, 2 Jul 2014 15:55:29 -0700

Ah,

~/logs ? for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i >
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
< tunable chooseleaf_vary_r 1

Looks like the chooseleaf_vary_r tunable somehow ended up divergent?

Pierre: do you recall how and when that got set?
-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just <sam.just at inktank.com> wrote:
> Yeah, divergent osdmaps:
> 555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
> 6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none
>
> Joao: thoughts?
> -Sam
>
> On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
> <pierre.blondeau at unicaen.fr> wrote:
>> The files
>>
>> When I upgrade :
>>  ceph-deploy install --stable firefly servers...
>>  on each servers service ceph restart mon
>>  on each servers service ceph restart osd
>>  on each servers service ceph restart mds
>>
>> I upgraded from emperor to firefly. After repair, remap, replace, etc ... I
>> have some PG which pass in peering state.
>>
>> I thought why not try the version 0.82, it could solve my problem. (
>> It's my mistake ). So, I upgrade from firefly to 0.83 with :
>>  ceph-deploy install --testing servers...
>>  ..
>>
>> Now, all programs are in version 0.82.
>> I have 3 mons, 36 OSD and 3 mds.
>>
>> Pierre
>>
>> PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta
>> directory.
>>
>> Le 03/07/2014 00:10, Samuel Just a ?crit :
>>
>>> Also, what version did you upgrade from, and how did you upgrade?
>>> -Sam
>>>
>>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just <sam.just at inktank.com> wrote:
>>>>
>>>> Ok, in current/meta on osd 20 and osd 23, please attach all files
>>>> matching
>>>>
>>>> ^osdmap.13258.*
>>>>
>>>> There should be one such file on each osd. (should look something like
>>>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
>>>> you'll want to use find).
>>>>
>>>> What version of ceph is running on your mons?  How many mons do you have?
>>>> -Sam
>>>>
>>>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I do it, the log files are available here :
>>>>> https://blondeau.users.greyc.fr/cephlog/debug20/
>>>>>
>>>>> The OSD's files are really big +/- 80M .
>>>>>
>>>>> After starting the osd.20 some other osd crash. I pass from 31 osd up to
>>>>> 16.
>>>>> I remark that after this the number of down+peering PG decrease from 367
>>>>> to
>>>>> 248. It's "normal" ? May be it's temporary, the time that the cluster
>>>>> verifies all the PG ?
>>>>>
>>>>> Regards
>>>>> Pierre
>>>>>
>>>>> Le 02/07/2014 19:16, Samuel Just a ?crit :
>>>>>
>>>>>> You should add
>>>>>>
>>>>>> debug osd = 20
>>>>>> debug filestore = 20
>>>>>> debug ms = 1
>>>>>>
>>>>>> to the [osd] section of the ceph.conf and restart the osds.  I'd like
>>>>>> all three logs if possible.
>>>>>>
>>>>>> Thanks
>>>>>> -Sam
>>>>>>
>>>>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Yes, but how i do that ?
>>>>>>>
>>>>>>> With a command like that ?
>>>>>>>
>>>>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
>>>>>>> --debug-ms
>>>>>>> 1'
>>>>>>>
>>>>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor because I
>>>>>>> use
>>>>>>> udev detection.
>>>>>>>
>>>>>>> When I have made these changes, you want the three log files or only
>>>>>>> osd.20's ?
>>>>>>>
>>>>>>> Thank you so much for the help
>>>>>>>
>>>>>>> Regards
>>>>>>> Pierre
>>>>>>>
>>>>>>> Le 01/07/2014 23:51, Samuel Just a ?crit :
>>>>>>>
>>>>>>>> Can you reproduce with
>>>>>>>> debug osd = 20
>>>>>>>> debug filestore = 20
>>>>>>>> debug ms = 1
>>>>>>>> ?
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
>>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I join :
>>>>>>>>>     - osd.20 is one of osd that I detect which makes crash other
>>>>>>>>> OSD.
>>>>>>>>>     - osd.23 is one of osd which crash when i start osd.20
>>>>>>>>>     - mds, is one of my MDS
>>>>>>>>>
>>>>>>>>> I cut log file because they are to big but. All is here :
>>>>>>>>> https://blondeau.users.greyc.fr/cephlog/
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Le 30/06/2014 17:35, Gregory Farnum a ?crit :
>>>>>>>>>
>>>>>>>>>> What's the backtrace from the crashing OSDs?
>>>>>>>>>>
>>>>>>>>>> Keep in mind that as a dev release, it's generally best not to
>>>>>>>>>> upgrade
>>>>>>>>>> to unnamed versions like 0.82 (but it's probably too late to go
>>>>>>>>>> back
>>>>>>>>>> now).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I will remember it the next time ;)
>>>>>>>>>
>>>>>>>>>> -Greg
>>>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
>>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> After the upgrade to firefly, I have some PG in peering state.
>>>>>>>>>>> I seen the output of 0.82 so I try to upgrade for solved my
>>>>>>>>>>> problem.
>>>>>>>>>>>
>>>>>>>>>>> My three MDS crash and some OSD triggers a chain reaction that
>>>>>>>>>>> kills
>>>>>>>>>>> other
>>>>>>>>>>> OSD.
>>>>>>>>>>> I think my MDS will not start because of the metadata are on the
>>>>>>>>>>> OSD.
>>>>>>>>>>>
>>>>>>>>>>> I have 36 OSD on three servers and I identified 5 OSD which makes
>>>>>>>>>>> crash
>>>>>>>>>>> others. If i not start their, the cluster passe in reconstructive
>>>>>>>>>>> state
>>>>>>>>>>> with
>>>>>>>>>>> 31 OSD but i have 378 in down+peering state.
>>>>>>>>>>>
>>>>>>>>>>> How can I do ? Would you more information ( os, crash log, etc ...
>>>>>>>>>>> )
>>>>>>>>>>> ?
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ----------------------------------------------
>>>>> Pierre BLONDEAU
>>>>> Administrateur Syst?mes & r?seaux
>>>>> Universit? de Caen
>>>>> Laboratoire GREYC, D?partement d'informatique
>>>>>
>>>>> tel     : 02 31 56 75 42
>>>>> bureau  : Campus 2, Science 3, 406
>>>>> ----------------------------------------------
>>>>>
>>
>>
>> --
>> ----------------------------------------------
>> Pierre BLONDEAU
>> Administrateur Syst?mes & r?seaux
>> Universit? de Caen
>> Laboratoire GREYC, D?partement d'informatique
>>
>> tel     : 02 31 56 75 42
>> bureau  : Campus 2, Science 3, 406
>> ----------------------------------------------