Some OSD and MDS crash

sam.just@xxxxxxxxxxx (Samuel Just) · Wed, 2 Jul 2014 15:43:08 -0700



Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
<pierre.blondeau at unicaen.fr> wrote:
> The files
>
> When I upgrade :
>  ceph-deploy install --stable firefly servers...
>  on each servers service ceph restart mon
>  on each servers service ceph restart osd
>  on each servers service ceph restart mds
>
> I upgraded from emperor to firefly. After repair, remap, replace, etc ... I
> have some PG which pass in peering state.
>
> I thought why not try the version 0.82, it could solve my problem. (
> It's my mistake ). So, I upgrade from firefly to 0.83 with :
>  ceph-deploy install --testing servers...
>  ..
>
> Now, all programs are in version 0.82.
> I have 3 mons, 36 OSD and 3 mds.
>
> Pierre
>
> PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta
> directory.
>
> Le 03/07/2014 00:10, Samuel Just a ?crit :
>
>> Also, what version did you upgrade from, and how did you upgrade?
>> -Sam
>>
>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just <sam.just at inktank.com> wrote:
>>>
>>> Ok, in current/meta on osd 20 and osd 23, please attach all files
>>> matching
>>>
>>> ^osdmap.13258.*
>>>
>>> There should be one such file on each osd. (should look something like
>>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
>>> you'll want to use find).
>>>
>>> What version of ceph is running on your mons?  How many mons do you have?
>>> -Sam
>>>
>>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I do it, the log files are available here :
>>>> https://blondeau.users.greyc.fr/cephlog/debug20/
>>>>
>>>> The OSD's files are really big +/- 80M .
>>>>
>>>> After starting the osd.20 some other osd crash. I pass from 31 osd up to
>>>> 16.
>>>> I remark that after this the number of down+peering PG decrease from 367
>>>> to
>>>> 248. It's "normal" ? May be it's temporary, the time that the cluster
>>>> verifies all the PG ?
>>>>
>>>> Regards
>>>> Pierre
>>>>
>>>> Le 02/07/2014 19:16, Samuel Just a ?crit :
>>>>
>>>>> You should add
>>>>>
>>>>> debug osd = 20
>>>>> debug filestore = 20
>>>>> debug ms = 1
>>>>>
>>>>> to the [osd] section of the ceph.conf and restart the osds.  I'd like
>>>>> all three logs if possible.
>>>>>
>>>>> Thanks
>>>>> -Sam
>>>>>
>>>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>
>>>>>>
>>>>>> Yes, but how i do that ?
>>>>>>
>>>>>> With a command like that ?
>>>>>>
>>>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
>>>>>> --debug-ms
>>>>>> 1'
>>>>>>
>>>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor because I
>>>>>> use
>>>>>> udev detection.
>>>>>>
>>>>>> When I have made these changes, you want the three log files or only
>>>>>> osd.20's ?
>>>>>>
>>>>>> Thank you so much for the help
>>>>>>
>>>>>> Regards
>>>>>> Pierre
>>>>>>
>>>>>> Le 01/07/2014 23:51, Samuel Just a ?crit :
>>>>>>
>>>>>>> Can you reproduce with
>>>>>>> debug osd = 20
>>>>>>> debug filestore = 20
>>>>>>> debug ms = 1
>>>>>>> ?
>>>>>>> -Sam
>>>>>>>
>>>>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I join :
>>>>>>>>     - osd.20 is one of osd that I detect which makes crash other
>>>>>>>> OSD.
>>>>>>>>     - osd.23 is one of osd which crash when i start osd.20
>>>>>>>>     - mds, is one of my MDS
>>>>>>>>
>>>>>>>> I cut log file because they are to big but. All is here :
>>>>>>>> https://blondeau.users.greyc.fr/cephlog/
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Le 30/06/2014 17:35, Gregory Farnum a ?crit :
>>>>>>>>
>>>>>>>>> What's the backtrace from the crashing OSDs?
>>>>>>>>>
>>>>>>>>> Keep in mind that as a dev release, it's generally best not to
>>>>>>>>> upgrade
>>>>>>>>> to unnamed versions like 0.82 (but it's probably too late to go
>>>>>>>>> back
>>>>>>>>> now).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I will remember it the next time ;)
>>>>>>>>
>>>>>>>>> -Greg
>>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>>>>>
>>>>>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> After the upgrade to firefly, I have some PG in peering state.
>>>>>>>>>> I seen the output of 0.82 so I try to upgrade for solved my
>>>>>>>>>> problem.
>>>>>>>>>>
>>>>>>>>>> My three MDS crash and some OSD triggers a chain reaction that
>>>>>>>>>> kills
>>>>>>>>>> other
>>>>>>>>>> OSD.
>>>>>>>>>> I think my MDS will not start because of the metadata are on the
>>>>>>>>>> OSD.
>>>>>>>>>>
>>>>>>>>>> I have 36 OSD on three servers and I identified 5 OSD which makes
>>>>>>>>>> crash
>>>>>>>>>> others. If i not start their, the cluster passe in reconstructive
>>>>>>>>>> state
>>>>>>>>>> with
>>>>>>>>>> 31 OSD but i have 378 in down+peering state.
>>>>>>>>>>
>>>>>>>>>> How can I do ? Would you more information ( os, crash log, etc ...
>>>>>>>>>> )
>>>>>>>>>> ?
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>
>>>>
>>>>
>>>> --
>>>> ----------------------------------------------
>>>> Pierre BLONDEAU
>>>> Administrateur Syst?mes & r?seaux
>>>> Universit? de Caen
>>>> Laboratoire GREYC, D?partement d'informatique
>>>>
>>>> tel     : 02 31 56 75 42
>>>> bureau  : Campus 2, Science 3, 406
>>>> ----------------------------------------------
>>>>
>
>
> --
> ----------------------------------------------
> Pierre BLONDEAU
> Administrateur Syst?mes & r?seaux
> Universit? de Caen
> Laboratoire GREYC, D?partement d'informatique
>
> tel     : 02 31 56 75 42
> bureau  : Campus 2, Science 3, 406
> ----------------------------------------------