Some OSD and MDS crash

sam.just@xxxxxxxxxxx (Samuel Just) · Wed, 2 Jul 2014 15:39:16 -0700



Joao: this looks like divergent osdmaps, osd 20 and osd 23 have
differing ideas of the acting set for pg 2.11.  Did we add hashes to
the incremental maps?  What would you want to know from the mons?
-Sam

On Wed, Jul 2, 2014 at 3:10 PM, Samuel Just <sam.just at inktank.com> wrote:
> Also, what version did you upgrade from, and how did you upgrade?
> -Sam
>
> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just <sam.just at inktank.com> wrote:
>> Ok, in current/meta on osd 20 and osd 23, please attach all files matching
>>
>> ^osdmap.13258.*
>>
>> There should be one such file on each osd. (should look something like
>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
>> you'll want to use find).
>>
>> What version of ceph is running on your mons?  How many mons do you have?
>> -Sam
>>
>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
>> <pierre.blondeau at unicaen.fr> wrote:
>>> Hi,
>>>
>>> I do it, the log files are available here :
>>> https://blondeau.users.greyc.fr/cephlog/debug20/
>>>
>>> The OSD's files are really big +/- 80M .
>>>
>>> After starting the osd.20 some other osd crash. I pass from 31 osd up to 16.
>>> I remark that after this the number of down+peering PG decrease from 367 to
>>> 248. It's "normal" ? May be it's temporary, the time that the cluster
>>> verifies all the PG ?
>>>
>>> Regards
>>> Pierre
>>>
>>> Le 02/07/2014 19:16, Samuel Just a ?crit :
>>>
>>>> You should add
>>>>
>>>> debug osd = 20
>>>> debug filestore = 20
>>>> debug ms = 1
>>>>
>>>> to the [osd] section of the ceph.conf and restart the osds.  I'd like
>>>> all three logs if possible.
>>>>
>>>> Thanks
>>>> -Sam
>>>>
>>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>
>>>>> Yes, but how i do that ?
>>>>>
>>>>> With a command like that ?
>>>>>
>>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
>>>>> --debug-ms
>>>>> 1'
>>>>>
>>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor because I
>>>>> use
>>>>> udev detection.
>>>>>
>>>>> When I have made these changes, you want the three log files or only
>>>>> osd.20's ?
>>>>>
>>>>> Thank you so much for the help
>>>>>
>>>>> Regards
>>>>> Pierre
>>>>>
>>>>> Le 01/07/2014 23:51, Samuel Just a ?crit :
>>>>>
>>>>>> Can you reproduce with
>>>>>> debug osd = 20
>>>>>> debug filestore = 20
>>>>>> debug ms = 1
>>>>>> ?
>>>>>> -Sam
>>>>>>
>>>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I join :
>>>>>>>    - osd.20 is one of osd that I detect which makes crash other OSD.
>>>>>>>    - osd.23 is one of osd which crash when i start osd.20
>>>>>>>    - mds, is one of my MDS
>>>>>>>
>>>>>>> I cut log file because they are to big but. All is here :
>>>>>>> https://blondeau.users.greyc.fr/cephlog/
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Le 30/06/2014 17:35, Gregory Farnum a ?crit :
>>>>>>>
>>>>>>>> What's the backtrace from the crashing OSDs?
>>>>>>>>
>>>>>>>> Keep in mind that as a dev release, it's generally best not to upgrade
>>>>>>>> to unnamed versions like 0.82 (but it's probably too late to go back
>>>>>>>> now).
>>>>>>>
>>>>>>>
>>>>>>> I will remember it the next time ;)
>>>>>>>
>>>>>>>> -Greg
>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>>>>
>>>>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
>>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> After the upgrade to firefly, I have some PG in peering state.
>>>>>>>>> I seen the output of 0.82 so I try to upgrade for solved my problem.
>>>>>>>>>
>>>>>>>>> My three MDS crash and some OSD triggers a chain reaction that kills
>>>>>>>>> other
>>>>>>>>> OSD.
>>>>>>>>> I think my MDS will not start because of the metadata are on the OSD.
>>>>>>>>>
>>>>>>>>> I have 36 OSD on three servers and I identified 5 OSD which makes
>>>>>>>>> crash
>>>>>>>>> others. If i not start their, the cluster passe in reconstructive
>>>>>>>>> state
>>>>>>>>> with
>>>>>>>>> 31 OSD but i have 378 in down+peering state.
>>>>>>>>>
>>>>>>>>> How can I do ? Would you more information ( os, crash log, etc ... )
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>> Regards
>>>
>>>
>>> --
>>> ----------------------------------------------
>>> Pierre BLONDEAU
>>> Administrateur Syst?mes & r?seaux
>>> Universit? de Caen
>>> Laboratoire GREYC, D?partement d'informatique
>>>
>>> tel     : 02 31 56 75 42
>>> bureau  : Campus 2, Science 3, 406
>>> ----------------------------------------------
>>>