Some OSD and MDS crash

pierre.blondeau@xxxxxxxxxx (Pierre BLONDEAU) · Mon, 07 Jul 2014 15:42:22 +0200

No chance to have those logs and even less in debug mode. I do this 
change 3 weeks ago.

I put all my log here if it's can help : 
https://blondeau.users.greyc.fr/cephlog/all/

I have a chance to recover my +/- 20TB of data ?

Regards

Le 03/07/2014 21:48, Joao Luis a ?crit :
> Do those logs have a higher debugging level than the default? If not
> nevermind as they will not have enough information. If they do however,
> we'd be interested in the portion around the moment you set the
> tunables. Say, before the upgrade and a bit after you set the tunable.
> If you want to be finer grained, then ideally it would be the moment
> where those maps were created, but you'd have to grep the logs for that.
>
> Or drop the logs somewhere and I'll take a look.
>
>    -Joao
>
> On Jul 3, 2014 5:48 PM, "Pierre BLONDEAU" <pierre.blondeau at unicaen.fr
> <mailto:pierre.blondeau at unicaen.fr>> wrote:
>
>     Le 03/07/2014 13:49, Joao Eduardo Luis a ?crit :
>
>         On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:
>
>             Le 03/07/2014 00:55, Samuel Just a ?crit :
>
>                 Ah,
>
>                 ~/logs ? for i in 20 23; do ../ceph/src/osdmaptool
>                 --export-crush
>                 /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d
>                 /tmp/crush$i >
>                 /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
>                 ../ceph/src/osdmaptool: osdmap file
>                 'osd-20_osdmap.13258__0___4E62BB79__none'
>                 ../ceph/src/osdmaptool: exported crush map to /tmp/crush20
>                 ../ceph/src/osdmaptool: osdmap file
>                 'osd-23_osdmap.13258__0___4E62BB79__none'
>                 ../ceph/src/osdmaptool: exported crush map to /tmp/crush23
>                 6d5
>                 < tunable chooseleaf_vary_r 1
>
>                   Looks like the chooseleaf_vary_r tunable somehow ended
>                 up divergent?
>
>
>         The only thing that comes to mind that could cause this is if we
>         changed
>         the leader's in-memory map, proposed it, it failed, and only the
>         leader
>         got to write the map to disk somehow.  This happened once on a
>         totally
>         different issue (although I can't pinpoint right now which).
>
>         In such a scenario, the leader would serve the incorrect osdmap to
>         whoever asked osdmaps from it, the remaining quorum would serve the
>         correct osdmaps to all the others.  This could cause this
>         divergence. Or
>         it could be something else.
>
>         Are there logs for the monitors for the timeframe this may have
>         happened
>         in?
>
>
>     Which exactly timeframe you want ? I have 7 days of logs, I should
>     have informations about the upgrade from firefly to 0.82.
>     Which mon's log do you want ? Three ?
>
>     Regards
>
>             -Joao
>
>
>                 Pierre: do you recall how and when that got set?
>
>
>             I am not sure to understand, but if I good remember after
>             the update in
>             firefly, I was in state : HEALTH_WARN crush map has legacy
>             tunables and
>             I see "feature set mismatch" in log.
>
>             So if I good remeber, i do : ceph osd crush tunables optimal
>             for the
>             problem of "crush map" and I update my client and server
>             kernel to
>             3.16rc.
>
>             It's could be that ?
>
>             Pierre
>
>                 -Sam
>
>                 On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just
>                 <sam.just at inktank.com <mailto:sam.just at inktank.com>>
>                 wrote:
>
>                     Yeah, divergent osdmaps:
>                     555ed048e73024687fc8b106a570db__4f
>                       osd-20_osdmap.13258__0___4E62BB79__none
>                     6037911f31dc3c18b05499d24dcdbe__5c
>                       osd-23_osdmap.13258__0___4E62BB79__none
>
>                     Joao: thoughts?
>                     -Sam
>
>                     On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
>                     <pierre.blondeau at unicaen.fr
>                     <mailto:pierre.blondeau at unicaen.fr>> wrote:
>
>                         The files
>
>                         When I upgrade :
>                            ceph-deploy install --stable firefly servers...
>                            on each servers service ceph restart mon
>                            on each servers service ceph restart osd
>                            on each servers service ceph restart mds
>
>                         I upgraded from emperor to firefly. After
>                         repair, remap, replace,
>                         etc ... I
>                         have some PG which pass in peering state.
>
>                         I thought why not try the version 0.82, it could
>                         solve my problem. (
>                         It's my mistake ). So, I upgrade from firefly to
>                         0.83 with :
>                            ceph-deploy install --testing servers...
>                            ..
>
>                         Now, all programs are in version 0.82.
>                         I have 3 mons, 36 OSD and 3 mds.
>
>                         Pierre
>
>                         PS : I find also
>                         "inc\uosdmap.13258__0___469271DE__none" on each meta
>                         directory.
>
>                         Le 03/07/2014 00:10, Samuel Just a ?crit :
>
>                             Also, what version did you upgrade from, and
>                             how did you upgrade?
>                             -Sam
>
>                             On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just
>                             <sam.just at inktank.com
>                             <mailto:sam.just at inktank.com>>
>                             wrote:
>
>
>                                 Ok, in current/meta on osd 20 and osd
>                                 23, please attach all files
>                                 matching
>
>                                 ^osdmap.13258.*
>
>                                 There should be one such file on each
>                                 osd. (should look something
>                                 like
>                                 osdmap.6__0_FD6E4C01__none, probably
>                                 hashed into a subdirectory,
>                                 you'll want to use find).
>
>                                 What version of ceph is running on your
>                                 mons?  How many mons do
>                                 you have?
>                                 -Sam
>
>                                 On Wed, Jul 2, 2014 at 2:21 PM, Pierre
>                                 BLONDEAU
>                                 <pierre.blondeau at unicaen.fr
>                                 <mailto:pierre.blondeau at unicaen.fr>> wrote:
>
>
>                                     Hi,
>
>                                     I do it, the log files are available
>                                     here :
>                                     https://blondeau.users.greyc.__fr/cephlog/debug20/
>                                     <https://blondeau.users.greyc.fr/cephlog/debug20/>
>
>                                     The OSD's files are really big +/- 80M .
>
>                                     After starting the osd.20 some other
>                                     osd crash. I pass from 31
>                                     osd up to
>                                     16.
>                                     I remark that after this the number
>                                     of down+peering PG decrease
>                                     from 367
>                                     to
>                                     248. It's "normal" ? May be it's
>                                     temporary, the time that the
>                                     cluster
>                                     verifies all the PG ?
>
>                                     Regards
>                                     Pierre
>
>                                     Le 02/07/2014 19:16, Samuel Just a
>                                     ?crit :
>
>                                         You should add
>
>                                         debug osd = 20
>                                         debug filestore = 20
>                                         debug ms = 1
>
>                                         to the [osd] section of the
>                                         ceph.conf and restart the osds.  I'd
>                                         like
>                                         all three logs if possible.
>
>                                         Thanks
>                                         -Sam
>
>                                         On Wed, Jul 2, 2014 at 5:03 AM,
>                                         Pierre BLONDEAU
>                                         <pierre.blondeau at unicaen.fr
>                                         <mailto:pierre.blondeau at unicaen.fr>>
>                                         wrote:
>
>
>
>                                             Yes, but how i do that ?
>
>                                             With a command like that ?
>
>                                             ceph tell osd.20 injectargs
>                                             '--debug-osd 20
>                                             --debug-filestore 20
>                                             --debug-ms
>                                             1'
>
>                                             By modify the
>                                             /etc/ceph/ceph.conf ? This
>                                             file is really poor
>                                             because I
>                                             use
>                                             udev detection.
>
>                                             When I have made these
>                                             changes, you want the three
>                                             log files or
>                                             only
>                                             osd.20's ?
>
>                                             Thank you so much for the help
>
>                                             Regards
>                                             Pierre
>
>                                             Le 01/07/2014 23:51, Samuel
>                                             Just a ?crit :
>
>                                                 Can you reproduce with
>                                                 debug osd = 20
>                                                 debug filestore = 20
>                                                 debug ms = 1
>                                                 ?
>                                                 -Sam
>
>                                                 On Tue, Jul 1, 2014 at
>                                                 1:21 AM, Pierre BLONDEAU
>                                                 <pierre.blondeau at unicaen.fr
>                                                 <mailto:pierre.blondeau at unicaen.fr>>
>                                                 wrote:
>
>
>
>
>                                                     Hi,
>
>                                                     I join :
>                                                           - osd.20 is
>                                                     one of osd that I
>                                                     detect which makes crash
>                                                     other
>                                                     OSD.
>                                                           - osd.23 is
>                                                     one of osd which
>                                                     crash when i start
>                                                     osd.20
>                                                           - mds, is one
>                                                     of my MDS
>
>                                                     I cut log file
>                                                     because they are to
>                                                     big but. All is here :
>                                                     https://blondeau.users.greyc.__fr/cephlog/
>                                                     <https://blondeau.users.greyc.fr/cephlog/>
>
>                                                     Regards
>
>                                                     Le 30/06/2014 17:35,
>                                                     Gregory Farnum a ?crit :
>
>                                                         What's the
>                                                         backtrace from
>                                                         the crashing OSDs?
>
>                                                         Keep in mind
>                                                         that as a dev
>                                                         release, it's
>                                                         generally best
>                                                         not to
>                                                         upgrade
>                                                         to unnamed
>                                                         versions like
>                                                         0.82 (but it's
>                                                         probably too late
>                                                         to go
>                                                         back
>                                                         now).
>
>
>
>
>                                                     I will remember it
>                                                     the next time ;)
>
>                                                         -Greg
>                                                         Software
>                                                         Engineer #42 @
>                                                         http://inktank.com
>                                                         | http://ceph.com
>
>                                                         On Mon, Jun 30,
>                                                         2014 at 8:06 AM,
>                                                         Pierre BLONDEAU
>                                                         <pierre.blondeau at unicaen.fr
>                                                         <mailto:pierre.blondeau at unicaen.fr>>
>                                                         wrote:
>
>
>
>                                                             Hi,
>
>                                                             After the
>                                                             upgrade to
>                                                             firefly, I
>                                                             have some PG
>                                                             in peering
>                                                             state.
>                                                             I seen the
>                                                             output of
>                                                             0.82 so I
>                                                             try to
>                                                             upgrade for
>                                                             solved my
>                                                             problem.
>
>                                                             My three MDS
>                                                             crash and
>                                                             some OSD
>                                                             triggers a
>                                                             chain reaction
>                                                             that
>                                                             kills
>                                                             other
>                                                             OSD.
>                                                             I think my
>                                                             MDS will not
>                                                             start
>                                                             because of
>                                                             the metadata are
>                                                             on the
>                                                             OSD.
>
>                                                             I have 36
>                                                             OSD on three
>                                                             servers and
>                                                             I identified
>                                                             5 OSD which
>                                                             makes
>                                                             crash
>                                                             others. If i
>                                                             not start
>                                                             their, the
>                                                             cluster passe in
>                                                             reconstructive
>                                                             state
>                                                             with
>                                                             31 OSD but i
>                                                             have 378 in
>                                                             down+peering
>                                                             state.
>
>                                                             How can I do
>                                                             ? Would you
>                                                             more
>                                                             information
>                                                             ( os, crash log,
>                                                             etc ...
>                                                             )
>                                                             ?
>
>                                                             Regards
>
>
>
>
>                                     --
>                                     ------------------------------__----------------
>                                     Pierre BLONDEAU
>                                     Administrateur Syst?mes & r?seaux
>                                     Universit? de Caen
>                                     Laboratoire GREYC, D?partement
>                                     d'informatique
>
>                                     tel     : 02 31 56 75 42
>                                     bureau  : Campus 2, Science 3, 406
>                                     ------------------------------__----------------
>
>
>
>                         --
>                         ------------------------------__----------------
>                         Pierre BLONDEAU
>                         Administrateur Syst?mes & r?seaux
>                         Universit? de Caen
>                         Laboratoire GREYC, D?partement d'informatique
>
>                         tel     : 02 31 56 75 42
>                         bureau  : Campus 2, Science 3, 406
>                         ------------------------------__----------------
>
>
>
>
>
>
>
>     --
>     ------------------------------__----------------
>     Pierre BLONDEAU
>     Administrateur Syst?mes & r?seaux
>     Universit? de Caen
>     Laboratoire GREYC, D?partement d'informatique
>
>     tel     : 02 31 56 75 42
>     bureau  : Campus 2, Science 3, 406
>     ------------------------------__----------------
>
>

-- 
----------------------------------------------
Pierre BLONDEAU
Administrateur Syst?mes & r?seaux
Universit? de Caen
Laboratoire GREYC, D?partement d'informatique

tel	: 02 31 56 75 42
bureau	: Campus 2, Science 3, 406
----------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2947 bytes
Desc: Signature cryptographique S/MIME
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140707/4ece8a71/attachment.bin>