Hi, Great. All my OSD restart : osdmap e438044: 36 osds: 36 up, 36 in All PG page are active and some in recovery : 1604040/49575206 objects degraded (3.236%) 1780 active+clean 17 active+degraded+remapped+backfilling 61 active+degraded+remapped+wait_backfill 11 active+clean+scrubbing+deep 34 active+remapped+backfilling 21 active+remapped+wait_backfill 4 active+clean+replay But all mds crash. Logs are here : https://blondeau.users.greyc.fr/cephlog/legacy/ In any case, thank you very much for your help. Pierre Le 09/07/2014 19:34, Joao Eduardo Luis a ?crit : > On 07/09/2014 02:22 PM, Pierre BLONDEAU wrote: >> Hi, >> >> There is any chance to restore my data ? > > Okay, I talked to Sam and here's what you could try before anything else: > > - Make sure you have everything running on the same version. > - unset the the chooseleaf_vary_r flag -- this can be accomplished by > setting tunables to legacy. > - have the osds join in the cluster > - you should then either upgrade to firefly (if you haven't done so by > now) or wait for the point-release before you move on to setting > tunables to optimal again. > > Let us know how it goes. > > -Joao > > >> >> Regards >> Pierre >> >> Le 07/07/2014 15:42, Pierre BLONDEAU a ?crit : >>> No chance to have those logs and even less in debug mode. I do this >>> change 3 weeks ago. >>> >>> I put all my log here if it's can help : >>> https://blondeau.users.greyc.fr/cephlog/all/ >>> >>> I have a chance to recover my +/- 20TB of data ? >>> >>> Regards >>> >>> Le 03/07/2014 21:48, Joao Luis a ?crit : >>>> Do those logs have a higher debugging level than the default? If not >>>> nevermind as they will not have enough information. If they do however, >>>> we'd be interested in the portion around the moment you set the >>>> tunables. Say, before the upgrade and a bit after you set the tunable. >>>> If you want to be finer grained, then ideally it would be the moment >>>> where those maps were created, but you'd have to grep the logs for >>>> that. >>>> >>>> Or drop the logs somewhere and I'll take a look. >>>> >>>> -Joao >>>> >>>> On Jul 3, 2014 5:48 PM, "Pierre BLONDEAU" <pierre.blondeau at unicaen.fr >>>> <mailto:pierre.blondeau at unicaen.fr>> wrote: >>>> >>>> Le 03/07/2014 13:49, Joao Eduardo Luis a ?crit : >>>> >>>> On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: >>>> >>>> Le 03/07/2014 00:55, Samuel Just a ?crit : >>>> >>>> Ah, >>>> >>>> ~/logs ? for i in 20 23; do ../ceph/src/osdmaptool >>>> --export-crush >>>> /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d >>>> /tmp/crush$i > >>>> /tmp/crush$i.d; done; diff /tmp/crush20.d >>>> /tmp/crush23.d >>>> ../ceph/src/osdmaptool: osdmap file >>>> 'osd-20_osdmap.13258__0___4E62BB79__none' >>>> ../ceph/src/osdmaptool: exported crush map to >>>> /tmp/crush20 >>>> ../ceph/src/osdmaptool: osdmap file >>>> 'osd-23_osdmap.13258__0___4E62BB79__none' >>>> ../ceph/src/osdmaptool: exported crush map to >>>> /tmp/crush23 >>>> 6d5 >>>> < tunable chooseleaf_vary_r 1 >>>> >>>> Looks like the chooseleaf_vary_r tunable somehow >>>> ended >>>> up divergent? >>>> >>>> >>>> The only thing that comes to mind that could cause this is >>>> if we >>>> changed >>>> the leader's in-memory map, proposed it, it failed, and only >>>> the >>>> leader >>>> got to write the map to disk somehow. This happened once on a >>>> totally >>>> different issue (although I can't pinpoint right now which). >>>> >>>> In such a scenario, the leader would serve the incorrect >>>> osdmap to >>>> whoever asked osdmaps from it, the remaining quorum would >>>> serve the >>>> correct osdmaps to all the others. This could cause this >>>> divergence. Or >>>> it could be something else. >>>> >>>> Are there logs for the monitors for the timeframe this may have >>>> happened >>>> in? >>>> >>>> >>>> Which exactly timeframe you want ? I have 7 days of logs, I should >>>> have informations about the upgrade from firefly to 0.82. >>>> Which mon's log do you want ? Three ? >>>> >>>> Regards >>>> >>>> -Joao >>>> >>>> >>>> Pierre: do you recall how and when that got set? >>>> >>>> >>>> I am not sure to understand, but if I good remember after >>>> the update in >>>> firefly, I was in state : HEALTH_WARN crush map has legacy >>>> tunables and >>>> I see "feature set mismatch" in log. >>>> >>>> So if I good remeber, i do : ceph osd crush tunables >>>> optimal >>>> for the >>>> problem of "crush map" and I update my client and server >>>> kernel to >>>> 3.16rc. >>>> >>>> It's could be that ? >>>> >>>> Pierre >>>> >>>> -Sam >>>> >>>> On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just >>>> <sam.just at inktank.com <mailto:sam.just at inktank.com>> >>>> wrote: >>>> >>>> Yeah, divergent osdmaps: >>>> 555ed048e73024687fc8b106a570db__4f >>>> osd-20_osdmap.13258__0___4E62BB79__none >>>> 6037911f31dc3c18b05499d24dcdbe__5c >>>> osd-23_osdmap.13258__0___4E62BB79__none >>>> >>>> Joao: thoughts? >>>> -Sam >>>> >>>> On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU >>>> <pierre.blondeau at unicaen.fr >>>> <mailto:pierre.blondeau at unicaen.fr>> wrote: >>>> >>>> The files >>>> >>>> When I upgrade : >>>> ceph-deploy install --stable firefly >>>> servers... >>>> on each servers service ceph restart mon >>>> on each servers service ceph restart osd >>>> on each servers service ceph restart mds >>>> >>>> I upgraded from emperor to firefly. After >>>> repair, remap, replace, >>>> etc ... I >>>> have some PG which pass in peering state. >>>> >>>> I thought why not try the version 0.82, it >>>> could >>>> solve my problem. ( >>>> It's my mistake ). So, I upgrade from >>>> firefly to >>>> 0.83 with : >>>> ceph-deploy install --testing servers... >>>> .. >>>> >>>> Now, all programs are in version 0.82. >>>> I have 3 mons, 36 OSD and 3 mds. >>>> >>>> Pierre >>>> >>>> PS : I find also >>>> "inc\uosdmap.13258__0___469271DE__none" on >>>> each meta >>>> directory. >>>> >>>> Le 03/07/2014 00:10, Samuel Just a ?crit : >>>> >>>> Also, what version did you upgrade from, >>>> and >>>> how did you upgrade? >>>> -Sam >>>> >>>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just >>>> <sam.just at inktank.com >>>> <mailto:sam.just at inktank.com>> >>>> wrote: >>>> >>>> >>>> Ok, in current/meta on osd 20 and osd >>>> 23, please attach all files >>>> matching >>>> >>>> ^osdmap.13258.* >>>> >>>> There should be one such file on each >>>> osd. (should look something >>>> like >>>> osdmap.6__0_FD6E4C01__none, probably >>>> hashed into a subdirectory, >>>> you'll want to use find). >>>> >>>> What version of ceph is running on your >>>> mons? How many mons do >>>> you have? >>>> -Sam >>>> >>>> On Wed, Jul 2, 2014 at 2:21 PM, Pierre >>>> BLONDEAU >>>> <pierre.blondeau at unicaen.fr >>>> <mailto:pierre.blondeau at unicaen.fr>> >>>> wrote: >>>> >>>> >>>> Hi, >>>> >>>> I do it, the log files are >>>> available >>>> here : >>>> >>>> https://blondeau.users.greyc.__fr/cephlog/debug20/ >>>> >>>> <https://blondeau.users.greyc.fr/cephlog/debug20/> >>>> >>>> The OSD's files are really big +/- >>>> 80M . >>>> >>>> After starting the osd.20 some >>>> other >>>> osd crash. I pass from 31 >>>> osd up to >>>> 16. >>>> I remark that after this the number >>>> of down+peering PG decrease >>>> from 367 >>>> to >>>> 248. It's "normal" ? May be it's >>>> temporary, the time that the >>>> cluster >>>> verifies all the PG ? >>>> >>>> Regards >>>> Pierre >>>> >>>> Le 02/07/2014 19:16, Samuel Just a >>>> ?crit : >>>> >>>> You should add >>>> >>>> debug osd = 20 >>>> debug filestore = 20 >>>> debug ms = 1 >>>> >>>> to the [osd] section of the >>>> ceph.conf and restart the >>>> osds. I'd >>>> like >>>> all three logs if possible. >>>> >>>> Thanks >>>> -Sam >>>> >>>> On Wed, Jul 2, 2014 at 5:03 AM, >>>> Pierre BLONDEAU >>>> <pierre.blondeau at unicaen.fr >>>> >>>> <mailto:pierre.blondeau at unicaen.fr>> >>>> wrote: >>>> >>>> >>>> >>>> Yes, but how i do that ? >>>> >>>> With a command like that ? >>>> >>>> ceph tell osd.20 injectargs >>>> '--debug-osd 20 >>>> --debug-filestore 20 >>>> --debug-ms >>>> 1' >>>> >>>> By modify the >>>> /etc/ceph/ceph.conf ? This >>>> file is really poor >>>> because I >>>> use >>>> udev detection. >>>> >>>> When I have made these >>>> changes, you want the three >>>> log files or >>>> only >>>> osd.20's ? >>>> >>>> Thank you so much for the >>>> help >>>> >>>> Regards >>>> Pierre >>>> >>>> Le 01/07/2014 23:51, Samuel >>>> Just a ?crit : >>>> >>>> Can you reproduce with >>>> debug osd = 20 >>>> debug filestore = 20 >>>> debug ms = 1 >>>> ? >>>> -Sam >>>> >>>> On Tue, Jul 1, 2014 at >>>> 1:21 AM, Pierre >>>> BLONDEAU >>>> >>>> <pierre.blondeau at unicaen.fr >>>> >>>> <mailto:pierre.blondeau at unicaen.fr>> >>>> wrote: >>>> >>>> >>>> >>>> >>>> Hi, >>>> >>>> I join : >>>> - osd.20 is >>>> one of osd that I >>>> detect which makes >>>> crash >>>> other >>>> OSD. >>>> - osd.23 is >>>> one of osd which >>>> crash when i start >>>> osd.20 >>>> - mds, is one >>>> of my MDS >>>> >>>> I cut log file >>>> because they are to >>>> big but. All is >>>> here : >>>> >>>> https://blondeau.users.greyc.__fr/cephlog/ >>>> >>>> <https://blondeau.users.greyc.fr/cephlog/> >>>> >>>> Regards >>>> >>>> Le 30/06/2014 >>>> 17:35, >>>> Gregory Farnum a >>>> ?crit : >>>> >>>> What's the >>>> backtrace from >>>> the crashing >>>> OSDs? >>>> >>>> Keep in mind >>>> that as a dev >>>> release, it's >>>> generally best >>>> not to >>>> upgrade >>>> to unnamed >>>> versions like >>>> 0.82 (but it's >>>> probably too >>>> late >>>> to go >>>> back >>>> now). >>>> >>>> >>>> >>>> >>>> I will remember it >>>> the next time ;) >>>> >>>> -Greg >>>> Software >>>> Engineer #42 @ >>>> >>>> http://inktank.com >>>> | >>>> http://ceph.com >>>> >>>> On Mon, Jun 30, >>>> 2014 at 8:06 >>>> AM, >>>> Pierre BLONDEAU >>>> >>>> <pierre.blondeau at unicaen.fr >>>> >>>> <mailto:pierre.blondeau at unicaen.fr>> >>>> wrote: >>>> >>>> >>>> >>>> Hi, >>>> >>>> After the >>>> upgrade to >>>> firefly, I >>>> have >>>> some PG >>>> in peering >>>> state. >>>> I seen the >>>> output of >>>> 0.82 so I >>>> try to >>>> upgrade for >>>> solved my >>>> problem. >>>> >>>> My three >>>> MDS >>>> crash and >>>> some OSD >>>> triggers a >>>> chain >>>> reaction >>>> that >>>> kills >>>> other >>>> OSD. >>>> I think my >>>> MDS will >>>> not >>>> start >>>> because of >>>> the >>>> metadata are >>>> on the >>>> OSD. >>>> >>>> I have 36 >>>> OSD on >>>> three >>>> servers and >>>> I >>>> identified >>>> 5 OSD which >>>> makes >>>> crash >>>> others. >>>> If i >>>> not start >>>> their, the >>>> cluster >>>> passe in >>>> >>>> reconstructive >>>> state >>>> with >>>> 31 OSD >>>> but i >>>> have 378 in >>>> >>>> down+peering >>>> state. >>>> >>>> How can >>>> I do >>>> ? Would you >>>> more >>>> information >>>> ( os, >>>> crash log, >>>> etc ... >>>> ) >>>> ? >>>> >>>> Regards >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> ------------------------------__---------------- >>>> Pierre BLONDEAU >>>> Administrateur Syst?mes & r?seaux >>>> Universit? de Caen >>>> Laboratoire GREYC, D?partement >>>> d'informatique >>>> >>>> tel : 02 31 56 75 42 >>>> bureau : Campus 2, Science 3, 406 >>>> >>>> ------------------------------__---------------- >>>> >>>> >>>> >>>> -- >>>> >>>> ------------------------------__---------------- >>>> Pierre BLONDEAU >>>> Administrateur Syst?mes & r?seaux >>>> Universit? de Caen >>>> Laboratoire GREYC, D?partement d'informatique >>>> >>>> tel : 02 31 56 75 42 >>>> bureau : Campus 2, Science 3, 406 >>>> >>>> ------------------------------__---------------- >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> ------------------------------__---------------- >>>> Pierre BLONDEAU >>>> Administrateur Syst?mes & r?seaux >>>> Universit? de Caen >>>> Laboratoire GREYC, D?partement d'informatique >>>> >>>> tel : 02 31 56 75 42 >>>> bureau : Campus 2, Science 3, 406 >>>> ------------------------------__---------------- >>>> >>>> >>> >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users at lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> > > -- ---------------------------------------------- Pierre BLONDEAU Administrateur Syst?mes & r?seaux Universit? de Caen Laboratoire GREYC, D?partement d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 ---------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2947 bytes Desc: Signature cryptographique S/MIME URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140710/694d8bd5/attachment.bin>