Some OSD and MDS crash

sam.just@xxxxxxxxxxx (Samuel Just) · Wed, 2 Jul 2014 16:17:54 -0700



Can you confirm from the admin socket that all monitors are running
the same version?
-Sam

On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU
<pierre.blondeau at unicaen.fr> wrote:
> Le 03/07/2014 00:55, Samuel Just a ?crit :
>
>> Ah,
>>
>> ~/logs ? for i in 20 23; do ../ceph/src/osdmaptool --export-crush
>> /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i >
>> /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
>> ../ceph/src/osdmaptool: osdmap file
>> 'osd-20_osdmap.13258__0_4E62BB79__none'
>> ../ceph/src/osdmaptool: exported crush map to /tmp/crush20
>> ../ceph/src/osdmaptool: osdmap file
>> 'osd-23_osdmap.13258__0_4E62BB79__none'
>> ../ceph/src/osdmaptool: exported crush map to /tmp/crush23
>> 6d5
>> < tunable chooseleaf_vary_r 1
>>
>> Looks like the chooseleaf_vary_r tunable somehow ended up divergent?
>>
>> Pierre: do you recall how and when that got set?
>
>
> I am not sure to understand, but if I good remember after the update in
> firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I
> see "feature set mismatch" in log.
>
> So if I good remeber, i do : ceph osd crush tunables optimal for the problem
> of "crush map" and I update my client and server kernel to 3.16rc.
>
> It's could be that ?
>
> Pierre
>
>
>> -Sam
>>
>> On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just <sam.just at inktank.com> wrote:
>>>
>>> Yeah, divergent osdmaps:
>>> 555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
>>> 6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none
>>>
>>> Joao: thoughts?
>>> -Sam
>>>
>>> On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>
>>>> The files
>>>>
>>>> When I upgrade :
>>>>   ceph-deploy install --stable firefly servers...
>>>>   on each servers service ceph restart mon
>>>>   on each servers service ceph restart osd
>>>>   on each servers service ceph restart mds
>>>>
>>>> I upgraded from emperor to firefly. After repair, remap, replace, etc
>>>> ... I
>>>> have some PG which pass in peering state.
>>>>
>>>> I thought why not try the version 0.82, it could solve my problem. (
>>>> It's my mistake ). So, I upgrade from firefly to 0.83 with :
>>>>   ceph-deploy install --testing servers...
>>>>   ..
>>>>
>>>> Now, all programs are in version 0.82.
>>>> I have 3 mons, 36 OSD and 3 mds.
>>>>
>>>> Pierre
>>>>
>>>> PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta
>>>> directory.
>>>>
>>>> Le 03/07/2014 00:10, Samuel Just a ?crit :
>>>>
>>>>> Also, what version did you upgrade from, and how did you upgrade?
>>>>> -Sam
>>>>>
>>>>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just <sam.just at inktank.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Ok, in current/meta on osd 20 and osd 23, please attach all files
>>>>>> matching
>>>>>>
>>>>>> ^osdmap.13258.*
>>>>>>
>>>>>> There should be one such file on each osd. (should look something like
>>>>>> osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
>>>>>> you'll want to use find).
>>>>>>
>>>>>> What version of ceph is running on your mons?  How many mons do you
>>>>>> have?
>>>>>> -Sam
>>>>>>
>>>>>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I do it, the log files are available here :
>>>>>>> https://blondeau.users.greyc.fr/cephlog/debug20/
>>>>>>>
>>>>>>> The OSD's files are really big +/- 80M .
>>>>>>>
>>>>>>> After starting the osd.20 some other osd crash. I pass from 31 osd up
>>>>>>> to
>>>>>>> 16.
>>>>>>> I remark that after this the number of down+peering PG decrease from
>>>>>>> 367
>>>>>>> to
>>>>>>> 248. It's "normal" ? May be it's temporary, the time that the cluster
>>>>>>> verifies all the PG ?
>>>>>>>
>>>>>>> Regards
>>>>>>> Pierre
>>>>>>>
>>>>>>> Le 02/07/2014 19:16, Samuel Just a ?crit :
>>>>>>>
>>>>>>>> You should add
>>>>>>>>
>>>>>>>> debug osd = 20
>>>>>>>> debug filestore = 20
>>>>>>>> debug ms = 1
>>>>>>>>
>>>>>>>> to the [osd] section of the ceph.conf and restart the osds.  I'd
>>>>>>>> like
>>>>>>>> all three logs if possible.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
>>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, but how i do that ?
>>>>>>>>>
>>>>>>>>> With a command like that ?
>>>>>>>>>
>>>>>>>>> ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
>>>>>>>>> --debug-ms
>>>>>>>>> 1'
>>>>>>>>>
>>>>>>>>> By modify the /etc/ceph/ceph.conf ? This file is really poor
>>>>>>>>> because I
>>>>>>>>> use
>>>>>>>>> udev detection.
>>>>>>>>>
>>>>>>>>> When I have made these changes, you want the three log files or
>>>>>>>>> only
>>>>>>>>> osd.20's ?
>>>>>>>>>
>>>>>>>>> Thank you so much for the help
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Pierre
>>>>>>>>>
>>>>>>>>> Le 01/07/2014 23:51, Samuel Just a ?crit :
>>>>>>>>>
>>>>>>>>>> Can you reproduce with
>>>>>>>>>> debug osd = 20
>>>>>>>>>> debug filestore = 20
>>>>>>>>>> debug ms = 1
>>>>>>>>>> ?
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
>>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I join :
>>>>>>>>>>>      - osd.20 is one of osd that I detect which makes crash other
>>>>>>>>>>> OSD.
>>>>>>>>>>>      - osd.23 is one of osd which crash when i start osd.20
>>>>>>>>>>>      - mds, is one of my MDS
>>>>>>>>>>>
>>>>>>>>>>> I cut log file because they are to big but. All is here :
>>>>>>>>>>> https://blondeau.users.greyc.fr/cephlog/
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>> Le 30/06/2014 17:35, Gregory Farnum a ?crit :
>>>>>>>>>>>
>>>>>>>>>>>> What's the backtrace from the crashing OSDs?
>>>>>>>>>>>>
>>>>>>>>>>>> Keep in mind that as a dev release, it's generally best not to
>>>>>>>>>>>> upgrade
>>>>>>>>>>>> to unnamed versions like 0.82 (but it's probably too late to go
>>>>>>>>>>>> back
>>>>>>>>>>>> now).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I will remember it the next time ;)
>>>>>>>>>>>
>>>>>>>>>>>> -Greg
>>>>>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
>>>>>>>>>>>> <pierre.blondeau at unicaen.fr> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> After the upgrade to firefly, I have some PG in peering state.
>>>>>>>>>>>>> I seen the output of 0.82 so I try to upgrade for solved my
>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My three MDS crash and some OSD triggers a chain reaction that
>>>>>>>>>>>>> kills
>>>>>>>>>>>>> other
>>>>>>>>>>>>> OSD.
>>>>>>>>>>>>> I think my MDS will not start because of the metadata are on
>>>>>>>>>>>>> the
>>>>>>>>>>>>> OSD.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have 36 OSD on three servers and I identified 5 OSD which
>>>>>>>>>>>>> makes
>>>>>>>>>>>>> crash
>>>>>>>>>>>>> others. If i not start their, the cluster passe in
>>>>>>>>>>>>> reconstructive
>>>>>>>>>>>>> state
>>>>>>>>>>>>> with
>>>>>>>>>>>>> 31 OSD but i have 378 in down+peering state.
>>>>>>>>>>>>>
>>>>>>>>>>>>> How can I do ? Would you more information ( os, crash log, etc
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> )
>>>>>>>>>>>>> ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ----------------------------------------------
>>>>>>> Pierre BLONDEAU
>>>>>>> Administrateur Syst?mes & r?seaux
>>>>>>> Universit? de Caen
>>>>>>> Laboratoire GREYC, D?partement d'informatique
>>>>>>>
>>>>>>> tel     : 02 31 56 75 42
>>>>>>> bureau  : Campus 2, Science 3, 406
>>>>>>> ----------------------------------------------
>>>>>>>
>>>>
>>>>
>>>> --
>>>> ----------------------------------------------
>>>> Pierre BLONDEAU
>>>> Administrateur Syst?mes & r?seaux
>>>> Universit? de Caen
>>>> Laboratoire GREYC, D?partement d'informatique
>>>>
>>>> tel     : 02 31 56 75 42
>>>> bureau  : Campus 2, Science 3, 406
>>>> ----------------------------------------------
>
>
>
> --
> ----------------------------------------------
> Pierre BLONDEAU
> Administrateur Syst?mes & r?seaux
> Universit? de Caen
> Laboratoire GREYC, D?partement d'informatique
>
> tel     : 02 31 56 75 42
> bureau  : Campus 2, Science 3, 406
> ----------------------------------------------
>