Some OSD and MDS crash

pierre.blondeau@xxxxxxxxxx (Pierre BLONDEAU) · Thu, 03 Jul 2014 00:39:39 +0200

The files

When I upgrade :
  ceph-deploy install --stable firefly servers...
  on each servers service ceph restart mon
  on each servers service ceph restart osd
  on each servers service ceph restart mds

I upgraded from emperor to firefly. After repair, remap, replace, etc 
... I have some PG which pass in peering state.

I thought why not try the version 0.82, it could solve my problem. (
It's my mistake ). So, I upgrade from firefly to 0.83 with :
  ceph-deploy install --testing servers...
  ..

Now, all programs are in version 0.82.
I have 3 mons, 36 OSD and 3 mds.

Pierre

PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta 
directory.

Le 03/07/2014 00:10, Samuel Just a ?crit :
> Also, what version did you upgrade from, and how did you upgrade?
> -Sam
>
> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just <sam.just at inktank.com> wrote:
>> Ok, in current/meta on osd 20 and osd 23, please attach all files matching
>>
>> ^osdmap.13258.*
>>
>> There should be one such file on each osd. (should look something like
>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
>> you'll want to use find).
>>
>> What version of ceph is running on your mons?  How many mons do you have?
>> -Sam
>>
>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
>> <pierre.blondeau at unicaen.fr> wrote:
>>> Hi,
>>>
>>> I do it, the log files are available here :
>>> https://blondeau.users.greyc.fr/cephlog/debug20/
>>>
>>> The OSD's files are really big +/- 80M .
>>>
>>> After starting the osd.20 some other osd crash. I pass from 31 osd up to 16.
>>> I remark that after this the number of down+peering PG decrease from 367 to
>>> 248. It's "normal" ? May be it's temporary, the time that the cluster
>>> verifies all the PG ?
>>>
>>> Regards
>>> Pierre
>>>
>>> Le 02/07/2014 19:16, Samuel Just a ?crit :
>>>
>>>> You should add
>>>>
>>>> debug osd = 20
>>>> debug filestore = 20
>>>> debug ms = 1
>>>>
>>>> to the [osd] section of the ceph.conf and restart the osds.  I'd like
>>>> all three logs if possible.
>>>>
>>>> Thanks
>>>> -Sam
>>>>
>>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>
>>>>> Yes, but how i do that ?
>>>>>
>>>>> With a command like that ?
>>>>>
>>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
>>>>> --debug-ms
>>>>> 1'
>>>>>
>>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor because I
>>>>> use
>>>>> udev detection.
>>>>>
>>>>> When I have made these changes, you want the three log files or only
>>>>> osd.20's ?
>>>>>
>>>>> Thank you so much for the help
>>>>>
>>>>> Regards
>>>>> Pierre
>>>>>
>>>>> Le 01/07/2014 23:51, Samuel Just a ?crit :
>>>>>
>>>>>> Can you reproduce with
>>>>>> debug osd = 20
>>>>>> debug filestore = 20
>>>>>> debug ms = 1
>>>>>> ?
>>>>>> -Sam
>>>>>>
>>>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I join :
>>>>>>>     - osd.20 is one of osd that I detect which makes crash other OSD.
>>>>>>>     - osd.23 is one of osd which crash when i start osd.20
>>>>>>>     - mds, is one of my MDS
>>>>>>>
>>>>>>> I cut log file because they are to big but. All is here :
>>>>>>> https://blondeau.users.greyc.fr/cephlog/
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Le 30/06/2014 17:35, Gregory Farnum a ?crit :
>>>>>>>
>>>>>>>> What's the backtrace from the crashing OSDs?
>>>>>>>>
>>>>>>>> Keep in mind that as a dev release, it's generally best not to upgrade
>>>>>>>> to unnamed versions like 0.82 (but it's probably too late to go back
>>>>>>>> now).
>>>>>>>
>>>>>>>
>>>>>>> I will remember it the next time ;)
>>>>>>>
>>>>>>>> -Greg
>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>>>>
>>>>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
>>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> After the upgrade to firefly, I have some PG in peering state.
>>>>>>>>> I seen the output of 0.82 so I try to upgrade for solved my problem.
>>>>>>>>>
>>>>>>>>> My three MDS crash and some OSD triggers a chain reaction that kills
>>>>>>>>> other
>>>>>>>>> OSD.
>>>>>>>>> I think my MDS will not start because of the metadata are on the OSD.
>>>>>>>>>
>>>>>>>>> I have 36 OSD on three servers and I identified 5 OSD which makes
>>>>>>>>> crash
>>>>>>>>> others. If i not start their, the cluster passe in reconstructive
>>>>>>>>> state
>>>>>>>>> with
>>>>>>>>> 31 OSD but i have 378 in down+peering state.
>>>>>>>>>
>>>>>>>>> How can I do ? Would you more information ( os, crash log, etc ... )
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>> Regards
>>>
>>>
>>> --
>>> ----------------------------------------------
>>> Pierre BLONDEAU
>>> Administrateur Syst?mes & r?seaux
>>> Universit? de Caen
>>> Laboratoire GREYC, D?partement d'informatique
>>>
>>> tel     : 02 31 56 75 42
>>> bureau  : Campus 2, Science 3, 406
>>> ----------------------------------------------
>>>

-- 
----------------------------------------------
Pierre BLONDEAU
Administrateur Syst?mes & r?seaux
Universit? de Caen
Laboratoire GREYC, D?partement d'informatique

tel	: 02 31 56 75 42
bureau	: Campus 2, Science 3, 406
----------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: osd-20_osdmap.13258__0_4E62BB79__none
Type: application/octet-stream
Size: 25423 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140703/6187a83b/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: osd-23_osdmap.13258__0_4E62BB79__none
Type: application/octet-stream
Size: 25423 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140703/6187a83b/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2947 bytes
Desc: Signature cryptographique S/MIME
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140703/6187a83b/attachment.bin>