Some OSD and MDS crash

pierre.blondeau@xxxxxxxxxx (Pierre BLONDEAU) · Thu, 03 Jul 2014 01:15:51 +0200

Le 03/07/2014 00:55, Samuel Just a ?crit :
> Ah,
>
> ~/logs ? for i in 20 23; do ../ceph/src/osdmaptool --export-crush
> /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i >
> /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
> ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none'
> ../ceph/src/osdmaptool: exported crush map to /tmp/crush20
> ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none'
> ../ceph/src/osdmaptool: exported crush map to /tmp/crush23
> 6d5
> < tunable chooseleaf_vary_r 1
>
> Looks like the chooseleaf_vary_r tunable somehow ended up divergent?
>
> Pierre: do you recall how and when that got set?

I am not sure to understand, but if I good remember after the update in 
firefly, I was in state : HEALTH_WARN crush map has legacy tunables and 
I see "feature set mismatch" in log.

So if I good remeber, i do : ceph osd crush tunables optimal for the 
problem of "crush map" and I update my client and server kernel to 3.16rc.

It's could be that ?

Pierre

> -Sam
>
> On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just <sam.just at inktank.com> wrote:
>> Yeah, divergent osdmaps:
>> 555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
>> 6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none
>>
>> Joao: thoughts?
>> -Sam
>>
>> On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
>> <pierre.blondeau at unicaen.fr> wrote:
>>> The files
>>>
>>> When I upgrade :
>>>   ceph-deploy install --stable firefly servers...
>>>   on each servers service ceph restart mon
>>>   on each servers service ceph restart osd
>>>   on each servers service ceph restart mds
>>>
>>> I upgraded from emperor to firefly. After repair, remap, replace, etc ... I
>>> have some PG which pass in peering state.
>>>
>>> I thought why not try the version 0.82, it could solve my problem. (
>>> It's my mistake ). So, I upgrade from firefly to 0.83 with :
>>>   ceph-deploy install --testing servers...
>>>   ..
>>>
>>> Now, all programs are in version 0.82.
>>> I have 3 mons, 36 OSD and 3 mds.
>>>
>>> Pierre
>>>
>>> PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta
>>> directory.
>>>
>>> Le 03/07/2014 00:10, Samuel Just a ?crit :
>>>
>>>> Also, what version did you upgrade from, and how did you upgrade?
>>>> -Sam
>>>>
>>>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just <sam.just at inktank.com> wrote:
>>>>>
>>>>> Ok, in current/meta on osd 20 and osd 23, please attach all files
>>>>> matching
>>>>>
>>>>> ^osdmap.13258.*
>>>>>
>>>>> There should be one such file on each osd. (should look something like
>>>>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
>>>>> you'll want to use find).
>>>>>
>>>>> What version of ceph is running on your mons?  How many mons do you have?
>>>>> -Sam
>>>>>
>>>>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I do it, the log files are available here :
>>>>>> https://blondeau.users.greyc.fr/cephlog/debug20/
>>>>>>
>>>>>> The OSD's files are really big +/- 80M .
>>>>>>
>>>>>> After starting the osd.20 some other osd crash. I pass from 31 osd up to
>>>>>> 16.
>>>>>> I remark that after this the number of down+peering PG decrease from 367
>>>>>> to
>>>>>> 248. It's "normal" ? May be it's temporary, the time that the cluster
>>>>>> verifies all the PG ?
>>>>>>
>>>>>> Regards
>>>>>> Pierre
>>>>>>
>>>>>> Le 02/07/2014 19:16, Samuel Just a ?crit :
>>>>>>
>>>>>>> You should add
>>>>>>>
>>>>>>> debug osd = 20
>>>>>>> debug filestore = 20
>>>>>>> debug ms = 1
>>>>>>>
>>>>>>> to the [osd] section of the ceph.conf and restart the osds.  I'd like
>>>>>>> all three logs if possible.
>>>>>>>
>>>>>>> Thanks
>>>>>>> -Sam
>>>>>>>
>>>>>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, but how i do that ?
>>>>>>>>
>>>>>>>> With a command like that ?
>>>>>>>>
>>>>>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
>>>>>>>> --debug-ms
>>>>>>>> 1'
>>>>>>>>
>>>>>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor because I
>>>>>>>> use
>>>>>>>> udev detection.
>>>>>>>>
>>>>>>>> When I have made these changes, you want the three log files or only
>>>>>>>> osd.20's ?
>>>>>>>>
>>>>>>>> Thank you so much for the help
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Pierre
>>>>>>>>
>>>>>>>> Le 01/07/2014 23:51, Samuel Just a ?crit :
>>>>>>>>
>>>>>>>>> Can you reproduce with
>>>>>>>>> debug osd = 20
>>>>>>>>> debug filestore = 20
>>>>>>>>> debug ms = 1
>>>>>>>>> ?
>>>>>>>>> -Sam
>>>>>>>>>
>>>>>>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I join :
>>>>>>>>>>      - osd.20 is one of osd that I detect which makes crash other
>>>>>>>>>> OSD.
>>>>>>>>>>      - osd.23 is one of osd which crash when i start osd.20
>>>>>>>>>>      - mds, is one of my MDS
>>>>>>>>>>
>>>>>>>>>> I cut log file because they are to big but. All is here :
>>>>>>>>>> https://blondeau.users.greyc.fr/cephlog/
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Le 30/06/2014 17:35, Gregory Farnum a ?crit :
>>>>>>>>>>
>>>>>>>>>>> What's the backtrace from the crashing OSDs?
>>>>>>>>>>>
>>>>>>>>>>> Keep in mind that as a dev release, it's generally best not to
>>>>>>>>>>> upgrade
>>>>>>>>>>> to unnamed versions like 0.82 (but it's probably too late to go
>>>>>>>>>>> back
>>>>>>>>>>> now).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I will remember it the next time ;)
>>>>>>>>>>
>>>>>>>>>>> -Greg
>>>>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
>>>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> After the upgrade to firefly, I have some PG in peering state.
>>>>>>>>>>>> I seen the output of 0.82 so I try to upgrade for solved my
>>>>>>>>>>>> problem.
>>>>>>>>>>>>
>>>>>>>>>>>> My three MDS crash and some OSD triggers a chain reaction that
>>>>>>>>>>>> kills
>>>>>>>>>>>> other
>>>>>>>>>>>> OSD.
>>>>>>>>>>>> I think my MDS will not start because of the metadata are on the
>>>>>>>>>>>> OSD.
>>>>>>>>>>>>
>>>>>>>>>>>> I have 36 OSD on three servers and I identified 5 OSD which makes
>>>>>>>>>>>> crash
>>>>>>>>>>>> others. If i not start their, the cluster passe in reconstructive
>>>>>>>>>>>> state
>>>>>>>>>>>> with
>>>>>>>>>>>> 31 OSD but i have 378 in down+peering state.
>>>>>>>>>>>>
>>>>>>>>>>>> How can I do ? Would you more information ( os, crash log, etc ...
>>>>>>>>>>>> )
>>>>>>>>>>>> ?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ----------------------------------------------
>>>>>> Pierre BLONDEAU
>>>>>> Administrateur Syst?mes & r?seaux
>>>>>> Universit? de Caen
>>>>>> Laboratoire GREYC, D?partement d'informatique
>>>>>>
>>>>>> tel     : 02 31 56 75 42
>>>>>> bureau  : Campus 2, Science 3, 406
>>>>>> ----------------------------------------------
>>>>>>
>>>
>>>
>>> --
>>> ----------------------------------------------
>>> Pierre BLONDEAU
>>> Administrateur Syst?mes & r?seaux
>>> Universit? de Caen
>>> Laboratoire GREYC, D?partement d'informatique
>>>
>>> tel     : 02 31 56 75 42
>>> bureau  : Campus 2, Science 3, 406
>>> ----------------------------------------------

-- 
----------------------------------------------
Pierre BLONDEAU
Administrateur Syst?mes & r?seaux
Universit? de Caen
Laboratoire GREYC, D?partement d'informatique

tel	: 02 31 56 75 42
bureau	: Campus 2, Science 3, 406
----------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2947 bytes
Desc: Signature cryptographique S/MIME
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140703/ee45dc06/attachment.bin>