On 5/27/16, 11:27 AM, "Gregory Farnum" <gfarnum@xxxxxxxxxx> wrote: >On Fri, May 27, 2016 at 9:44 AM, Stillwell, Bryan J ><Bryan.Stillwell@xxxxxxxxxxx> wrote: >> I have a Ceph cluster at home that I¹ve been running CephFS on for the >> last few years. Recently my MDS server became damaged and while >> attempting to fix it I believe I¹ve destroyed by CephFS journal based >>off >> this: >> >> 2016-05-25 16:48:23.882095 7f8d2fac2700 -1 log_channel(cluster) log >>[ERR] >> : Error recovering journal 200: (2) No such file or directory >> >> As far as I can tell the data and metadata are still in tact, so I¹m >> wondering if there¹s a way to rebuild the cephfs journal or if that¹s >>not >> possible, a way to start extracting the data? > >Check out http://docs.ceph.com/docs/master/cephfs/disaster-recovery/ > >You'll want to make sure you've actually lost the whole journal (how >did you manage that?!?!), reset it, and quite possibly run the data >scan tools. Be careful! So I actually got into this mess by following that page and not being as careful as I should have been. I started off by trying to backup the journal, but it failed for this reason: # cephfs-journal-tool journal export backup.bin 2016-05-25 15:25:26.541767 7f2932ee5bc0 -1 Missing object 200.00000197 2016-05-25 15:25:26.543896 7f2932ee5bc0 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados` Error ((5) Input/output error) I took a look at http://tracker.ceph.com/issues/9902, but scanning that page I didn't see a way to do an object-by-object dump. Now if I attempt to export the journal I get: # cephfs-journal-tool journal export backup.bin Error ((5) Input/output error)2016-05-27 14:19:49.807482 7f06fa378bc0 -1 Header 200.00000000 is unreadable 2016-05-27 14:19:49.807491 7f06fa378bc0 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados` I believe the 'Missing object 200.00000197' error had something to do with this problem that I was trying to deal with: http://comments.gmane.org/gmane.comp.file-systems.ceph.user/29844 The missing object was probably caused by being a little too aggressive with running mark_unfound_lost. Anyways, I continued on with the disaster recovery steps without making a backup first. The next step identified the missing object again: # cephfs-journal-tool event recover_dentries summary 2016-05-25 15:36:35.455989 7fa37b8b1bc0 -1 Missing object 200.00000197 Events by type: OPEN: 12548 SESSION: 24 SUBTREEMAP: 29 UPDATE: 12254 Errors: 0 I then tried truncating the journal: # cephfs-journal-tool journal reset old journal was 1666720764~48749572 new journal start will be 1719664640 (4194304 bytes past old end) writing journal head writing EResetJournal entry done Reset the session map: # cephfs-table-tool all reset session { "0": { "data": {}, "result": 0 } } And then because I was still having problems starting the MDS I ran: # ceph fs reset cephfs --yes-i-really-mean-it That's when I believe Header 200.00000000 went missing (I could be wrong, I don't have good notes around this part). So would the next steps be to run the following commands?: cephfs-table-tool 0 reset session cephfs-table-tool 0 reset snap cephfs-table-tool 0 reset inode cephfs-journal-tool --rank=0 journal reset cephfs-data-scan init cephfs-data-scan scan_extents data cephfs-data-scan scan_inodes data Thanks, Bryan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com