Some OSD and MDS crash

pierre.blondeau@xxxxxxxxxx (Pierre BLONDEAU) · Wed, 09 Jul 2014 15:22:51 +0200

Hi,

There is any chance to restore my data ?

Regards
Pierre

Le 07/07/2014 15:42, Pierre BLONDEAU a ?crit :
> No chance to have those logs and even less in debug mode. I do this
> change 3 weeks ago.
>
> I put all my log here if it's can help :
> https://blondeau.users.greyc.fr/cephlog/all/
>
> I have a chance to recover my +/- 20TB of data ?
>
> Regards
>
> Le 03/07/2014 21:48, Joao Luis a ?crit :
>> Do those logs have a higher debugging level than the default? If not
>> nevermind as they will not have enough information. If they do however,
>> we'd be interested in the portion around the moment you set the
>> tunables. Say, before the upgrade and a bit after you set the tunable.
>> If you want to be finer grained, then ideally it would be the moment
>> where those maps were created, but you'd have to grep the logs for that.
>>
>> Or drop the logs somewhere and I'll take a look.
>>
>>    -Joao
>>
>> On Jul 3, 2014 5:48 PM, "Pierre BLONDEAU" <pierre.blondeau at unicaen.fr
>> <mailto:pierre.blondeau at unicaen.fr>> wrote:
>>
>>     Le 03/07/2014 13:49, Joao Eduardo Luis a ?crit :
>>
>>         On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:
>>
>>             Le 03/07/2014 00:55, Samuel Just a ?crit :
>>
>>                 Ah,
>>
>>                 ~/logs ? for i in 20 23; do ../ceph/src/osdmaptool
>>                 --export-crush
>>                 /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d
>>                 /tmp/crush$i >
>>                 /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
>>                 ../ceph/src/osdmaptool: osdmap file
>>                 'osd-20_osdmap.13258__0___4E62BB79__none'
>>                 ../ceph/src/osdmaptool: exported crush map to
>> /tmp/crush20
>>                 ../ceph/src/osdmaptool: osdmap file
>>                 'osd-23_osdmap.13258__0___4E62BB79__none'
>>                 ../ceph/src/osdmaptool: exported crush map to
>> /tmp/crush23
>>                 6d5
>>                 < tunable chooseleaf_vary_r 1
>>
>>                   Looks like the chooseleaf_vary_r tunable somehow ended
>>                 up divergent?
>>
>>
>>         The only thing that comes to mind that could cause this is if we
>>         changed
>>         the leader's in-memory map, proposed it, it failed, and only the
>>         leader
>>         got to write the map to disk somehow.  This happened once on a
>>         totally
>>         different issue (although I can't pinpoint right now which).
>>
>>         In such a scenario, the leader would serve the incorrect
>> osdmap to
>>         whoever asked osdmaps from it, the remaining quorum would
>> serve the
>>         correct osdmaps to all the others.  This could cause this
>>         divergence. Or
>>         it could be something else.
>>
>>         Are there logs for the monitors for the timeframe this may have
>>         happened
>>         in?
>>
>>
>>     Which exactly timeframe you want ? I have 7 days of logs, I should
>>     have informations about the upgrade from firefly to 0.82.
>>     Which mon's log do you want ? Three ?
>>
>>     Regards
>>
>>             -Joao
>>
>>
>>                 Pierre: do you recall how and when that got set?
>>
>>
>>             I am not sure to understand, but if I good remember after
>>             the update in
>>             firefly, I was in state : HEALTH_WARN crush map has legacy
>>             tunables and
>>             I see "feature set mismatch" in log.
>>
>>             So if I good remeber, i do : ceph osd crush tunables optimal
>>             for the
>>             problem of "crush map" and I update my client and server
>>             kernel to
>>             3.16rc.
>>
>>             It's could be that ?
>>
>>             Pierre
>>
>>                 -Sam
>>
>>                 On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just
>>                 <sam.just at inktank.com <mailto:sam.just at inktank.com>>
>>                 wrote:
>>
>>                     Yeah, divergent osdmaps:
>>                     555ed048e73024687fc8b106a570db__4f
>>                       osd-20_osdmap.13258__0___4E62BB79__none
>>                     6037911f31dc3c18b05499d24dcdbe__5c
>>                       osd-23_osdmap.13258__0___4E62BB79__none
>>
>>                     Joao: thoughts?
>>                     -Sam
>>
>>                     On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
>>                     <pierre.blondeau at unicaen.fr
>>                     <mailto:pierre.blondeau at unicaen.fr>> wrote:
>>
>>                         The files
>>
>>                         When I upgrade :
>>                            ceph-deploy install --stable firefly
>> servers...
>>                            on each servers service ceph restart mon
>>                            on each servers service ceph restart osd
>>                            on each servers service ceph restart mds
>>
>>                         I upgraded from emperor to firefly. After
>>                         repair, remap, replace,
>>                         etc ... I
>>                         have some PG which pass in peering state.
>>
>>                         I thought why not try the version 0.82, it could
>>                         solve my problem. (
>>                         It's my mistake ). So, I upgrade from firefly to
>>                         0.83 with :
>>                            ceph-deploy install --testing servers...
>>                            ..
>>
>>                         Now, all programs are in version 0.82.
>>                         I have 3 mons, 36 OSD and 3 mds.
>>
>>                         Pierre
>>
>>                         PS : I find also
>>                         "inc\uosdmap.13258__0___469271DE__none" on
>> each meta
>>                         directory.
>>
>>                         Le 03/07/2014 00:10, Samuel Just a ?crit :
>>
>>                             Also, what version did you upgrade from, and
>>                             how did you upgrade?
>>                             -Sam
>>
>>                             On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just
>>                             <sam.just at inktank.com
>>                             <mailto:sam.just at inktank.com>>
>>                             wrote:
>>
>>
>>                                 Ok, in current/meta on osd 20 and osd
>>                                 23, please attach all files
>>                                 matching
>>
>>                                 ^osdmap.13258.*
>>
>>                                 There should be one such file on each
>>                                 osd. (should look something
>>                                 like
>>                                 osdmap.6__0_FD6E4C01__none, probably
>>                                 hashed into a subdirectory,
>>                                 you'll want to use find).
>>
>>                                 What version of ceph is running on your
>>                                 mons?  How many mons do
>>                                 you have?
>>                                 -Sam
>>
>>                                 On Wed, Jul 2, 2014 at 2:21 PM, Pierre
>>                                 BLONDEAU
>>                                 <pierre.blondeau at unicaen.fr
>>                                 <mailto:pierre.blondeau at unicaen.fr>>
>> wrote:
>>
>>
>>                                     Hi,
>>
>>                                     I do it, the log files are available
>>                                     here :
>>
>> https://blondeau.users.greyc.__fr/cephlog/debug20/
>>
>> <https://blondeau.users.greyc.fr/cephlog/debug20/>
>>
>>                                     The OSD's files are really big +/-
>> 80M .
>>
>>                                     After starting the osd.20 some other
>>                                     osd crash. I pass from 31
>>                                     osd up to
>>                                     16.
>>                                     I remark that after this the number
>>                                     of down+peering PG decrease
>>                                     from 367
>>                                     to
>>                                     248. It's "normal" ? May be it's
>>                                     temporary, the time that the
>>                                     cluster
>>                                     verifies all the PG ?
>>
>>                                     Regards
>>                                     Pierre
>>
>>                                     Le 02/07/2014 19:16, Samuel Just a
>>                                     ?crit :
>>
>>                                         You should add
>>
>>                                         debug osd = 20
>>                                         debug filestore = 20
>>                                         debug ms = 1
>>
>>                                         to the [osd] section of the
>>                                         ceph.conf and restart the
>> osds.  I'd
>>                                         like
>>                                         all three logs if possible.
>>
>>                                         Thanks
>>                                         -Sam
>>
>>                                         On Wed, Jul 2, 2014 at 5:03 AM,
>>                                         Pierre BLONDEAU
>>                                         <pierre.blondeau at unicaen.fr
>>
>> <mailto:pierre.blondeau at unicaen.fr>>
>>                                         wrote:
>>
>>
>>
>>                                             Yes, but how i do that ?
>>
>>                                             With a command like that ?
>>
>>                                             ceph tell osd.20 injectargs
>>                                             '--debug-osd 20
>>                                             --debug-filestore 20
>>                                             --debug-ms
>>                                             1'
>>
>>                                             By modify the
>>                                             /etc/ceph/ceph.conf ? This
>>                                             file is really poor
>>                                             because I
>>                                             use
>>                                             udev detection.
>>
>>                                             When I have made these
>>                                             changes, you want the three
>>                                             log files or
>>                                             only
>>                                             osd.20's ?
>>
>>                                             Thank you so much for the
>> help
>>
>>                                             Regards
>>                                             Pierre
>>
>>                                             Le 01/07/2014 23:51, Samuel
>>                                             Just a ?crit :
>>
>>                                                 Can you reproduce with
>>                                                 debug osd = 20
>>                                                 debug filestore = 20
>>                                                 debug ms = 1
>>                                                 ?
>>                                                 -Sam
>>
>>                                                 On Tue, Jul 1, 2014 at
>>                                                 1:21 AM, Pierre BLONDEAU
>>
>> <pierre.blondeau at unicaen.fr
>>
>> <mailto:pierre.blondeau at unicaen.fr>>
>>                                                 wrote:
>>
>>
>>
>>
>>                                                     Hi,
>>
>>                                                     I join :
>>                                                           - osd.20 is
>>                                                     one of osd that I
>>                                                     detect which makes
>> crash
>>                                                     other
>>                                                     OSD.
>>                                                           - osd.23 is
>>                                                     one of osd which
>>                                                     crash when i start
>>                                                     osd.20
>>                                                           - mds, is one
>>                                                     of my MDS
>>
>>                                                     I cut log file
>>                                                     because they are to
>>                                                     big but. All is
>> here :
>>
>> https://blondeau.users.greyc.__fr/cephlog/
>>
>> <https://blondeau.users.greyc.fr/cephlog/>
>>
>>                                                     Regards
>>
>>                                                     Le 30/06/2014 17:35,
>>                                                     Gregory Farnum a
>> ?crit :
>>
>>                                                         What's the
>>                                                         backtrace from
>>                                                         the crashing
>> OSDs?
>>
>>                                                         Keep in mind
>>                                                         that as a dev
>>                                                         release, it's
>>                                                         generally best
>>                                                         not to
>>                                                         upgrade
>>                                                         to unnamed
>>                                                         versions like
>>                                                         0.82 (but it's
>>                                                         probably too late
>>                                                         to go
>>                                                         back
>>                                                         now).
>>
>>
>>
>>
>>                                                     I will remember it
>>                                                     the next time ;)
>>
>>                                                         -Greg
>>                                                         Software
>>                                                         Engineer #42 @
>>
>> http://inktank.com
>>                                                         | http://ceph.com
>>
>>                                                         On Mon, Jun 30,
>>                                                         2014 at 8:06 AM,
>>                                                         Pierre BLONDEAU
>>
>> <pierre.blondeau at unicaen.fr
>>
>> <mailto:pierre.blondeau at unicaen.fr>>
>>                                                         wrote:
>>
>>
>>
>>                                                             Hi,
>>
>>                                                             After the
>>                                                             upgrade to
>>                                                             firefly, I
>>                                                             have some PG
>>                                                             in peering
>>                                                             state.
>>                                                             I seen the
>>                                                             output of
>>                                                             0.82 so I
>>                                                             try to
>>                                                             upgrade for
>>                                                             solved my
>>                                                             problem.
>>
>>                                                             My three MDS
>>                                                             crash and
>>                                                             some OSD
>>                                                             triggers a
>>                                                             chain
>> reaction
>>                                                             that
>>                                                             kills
>>                                                             other
>>                                                             OSD.
>>                                                             I think my
>>                                                             MDS will not
>>                                                             start
>>                                                             because of
>>                                                             the
>> metadata are
>>                                                             on the
>>                                                             OSD.
>>
>>                                                             I have 36
>>                                                             OSD on three
>>                                                             servers and
>>                                                             I identified
>>                                                             5 OSD which
>>                                                             makes
>>                                                             crash
>>                                                             others. If i
>>                                                             not start
>>                                                             their, the
>>                                                             cluster
>> passe in
>>
>> reconstructive
>>                                                             state
>>                                                             with
>>                                                             31 OSD but i
>>                                                             have 378 in
>>                                                             down+peering
>>                                                             state.
>>
>>                                                             How can I do
>>                                                             ? Would you
>>                                                             more
>>                                                             information
>>                                                             ( os,
>> crash log,
>>                                                             etc ...
>>                                                             )
>>                                                             ?
>>
>>                                                             Regards
>>
>>
>>
>>
>>                                     --
>>
>> ------------------------------__----------------
>>                                     Pierre BLONDEAU
>>                                     Administrateur Syst?mes & r?seaux
>>                                     Universit? de Caen
>>                                     Laboratoire GREYC, D?partement
>>                                     d'informatique
>>
>>                                     tel     : 02 31 56 75 42
>>                                     bureau  : Campus 2, Science 3, 406
>>
>> ------------------------------__----------------
>>
>>
>>
>>                         --
>>                         ------------------------------__----------------
>>                         Pierre BLONDEAU
>>                         Administrateur Syst?mes & r?seaux
>>                         Universit? de Caen
>>                         Laboratoire GREYC, D?partement d'informatique
>>
>>                         tel     : 02 31 56 75 42
>>                         bureau  : Campus 2, Science 3, 406
>>                         ------------------------------__----------------
>>
>>
>>
>>
>>
>>
>>
>>     --
>>     ------------------------------__----------------
>>     Pierre BLONDEAU
>>     Administrateur Syst?mes & r?seaux
>>     Universit? de Caen
>>     Laboratoire GREYC, D?partement d'informatique
>>
>>     tel     : 02 31 56 75 42
>>     bureau  : Campus 2, Science 3, 406
>>     ------------------------------__----------------
>>
>>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
----------------------------------------------
Pierre BLONDEAU
Administrateur Syst?mes & r?seaux
Universit? de Caen
Laboratoire GREYC, D?partement d'informatique

tel	: 02 31 56 75 42
bureau	: Campus 2, Science 3, 406
----------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2947 bytes
Desc: Signature cryptographique S/MIME
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140709/a148882b/attachment.bin>