After your comment about the dual mds servers I decided to just give up trying to get the second restarted. After eyeballing what I had on one of the new Ryzen boxes for drive space, I decided to just dump the filesystem. That will also make things go faster if and when I flip everything over to bluestore. So far so good... I just took a peek and saw the files being owned by Mr root though. Is there going to be an ownership reset at some point or will I have to resolve that by hand? On 10/12/2017 06:09 AM, John Spray wrote: > On Thu, Oct 12, 2017 at 12:23 AM, Bill Sharer <bsharer@xxxxxxxxxxxxxx> wrote: >> I was wondering if I can't get the second mds back up.... That offline >> backward scrub check sounds like it should be able to also salvage what >> it can of the two pools to a normal filesystem. Is there an option for >> that or has someone written some form of salvage tool? > Yep, cephfs-data-scan can do that. > > To scrape the files out of a CephFS data pool to a local filesystem, do this: > cephfs-data-scan scan_extents <data pool name> # this is discovering > all the file sizes > cephfs-data-scan scan_inodes --output-dir /tmp/my_output <data pool name> > > The time taken by both these commands scales linearly with the number > of objects in your data pool. > > This tool may not see the correct filename for recently created files > (any file whose metadata is in the journal but not flushed), these > files will go into a lost+found directory, named after their inode > number. > > John > >> On 10/11/2017 07:07 AM, John Spray wrote: >>> On Wed, Oct 11, 2017 at 1:42 AM, Bill Sharer <bsharer@xxxxxxxxxxxxxx> wrote: >>>> I've been in the process of updating my gentoo based cluster both with >>>> new hardware and a somewhat postponed update. This includes some major >>>> stuff including the switch from gcc 4.x to 5.4.0 on existing hardware >>>> and using gcc 6.4.0 to make better use of AMD Ryzen on the new >>>> hardware. The existing cluster was on 10.2.2, but I was going to >>>> 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin >>>> transitioning to bluestore on the osd's. >>>> >>>> The Ryzen units are slated to be bluestore based OSD servers if and when >>>> I get to that point. Up until the mds failure, they were simply cephfs >>>> clients. I had three OSD servers updated to 10.2.7-r1 (one is also a >>>> MON) and had two servers left to update. Both of these are also MONs >>>> and were acting as a pair of dual active MDS servers running 10.2.2. >>>> Monday morning I found out the hard way that an UPS one of them was on >>>> has a dead battery. After I fsck'd and came back up, I saw the >>>> following assertion error when it was trying to start it's mds.B server: >>>> >>>> >>>> ==== mdsbeacon(64162/B up:replay seq 3 v4699) v7 ==== 126+0+0 (709014160 >>>> 0 0) 0x7f6fb4001bc0 con 0x55f94779d >>>> 8d0 >>>> 0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In >>>> function 'virtual void EImportStart::r >>>> eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972 >>>> mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv) >>>> >>>> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) >>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>> const*)+0x82) [0x55f93d64a122] >>>> 2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce] >>>> 3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34] >>>> 4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d] >>>> 5: (()+0x74a4) [0x7f6fd009b4a4] >>>> 6: (clone()+0x6d) [0x7f6fce5a598d] >>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >>>> needed to interpret this. >>>> >>>> --- logging levels --- >>>> 0/ 5 none >>>> 0/ 1 lockdep >>>> 0/ 1 context >>>> 1/ 1 crush >>>> 1/ 5 mds >>>> 1/ 5 mds_balancer >>>> 1/ 5 mds_locker >>>> 1/ 5 mds_log >>>> 1/ 5 mds_log_expire >>>> 1/ 5 mds_migrator >>>> 0/ 1 buffer >>>> 0/ 1 timer >>>> 0/ 1 filer >>>> 0/ 1 striper >>>> 0/ 1 objecter >>>> 0/ 5 rados >>>> 0/ 5 rbd >>>> 0/ 5 rbd_mirror >>>> 0/ 5 rbd_replay >>>> 0/ 5 journaler >>>> 0/ 5 objectcacher >>>> 0/ 5 client >>>> 0/ 5 osd >>>> 0/ 5 optracker >>>> 0/ 5 objclass >>>> 1/ 3 filestore >>>> 1/ 3 journal >>>> 0/ 5 ms >>>> 1/ 5 mon >>>> 0/10 monc >>>> 1/ 5 paxos >>>> 0/ 5 tp >>>> 1/ 5 auth >>>> 1/ 5 crypto >>>> 1/ 1 finisher >>>> 1/ 5 heartbeatmap >>>> 1/ 5 perfcounter >>>> 1/ 5 rgw >>>> 1/10 civetweb >>>> 1/ 5 javaclient >>>> 1/ 5 asok >>>> 1/ 1 throttle >>>> 0/ 0 refs >>>> 1/ 5 xio >>>> 1/ 5 compressor >>>> 1/ 5 newstore >>>> 1/ 5 bluestore >>>> 1/ 5 bluefs >>>> 1/ 3 bdev >>>> 1/ 5 kstore >>>> 4/ 5 rocksdb >>>> 4/ 5 leveldb >>>> 1/ 5 kinetic >>>> 1/ 5 fuse >>>> -2/-2 (syslog threshold) >>>> -1/-1 (stderr threshold) >>>> max_recent 10000 >>>> max_new 1000 >>>> log_file /var/log/ceph/ceph-mds.B.log >>>> >>>> >>>> >>>> When I was googling around, I ran into this Cern presentation and tried >>>> out the offline backware scrubbing commands on slide 25 first: >>>> >>>> https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf >>>> >>>> >>>> Both ran without any messages, so I'm assuming I have sane contents in >>>> the cephfs_data and cephfs_metadata pools. Still no luck getting things >>>> restarted, so I tried the cephfs-journal-tool journal reset on slide >>>> 23. That didn't work either. Just for giggles, I tried setting up the >>>> two Ryzen boxes as new mds.C and mds.D servers which would run on >>>> 10.2.7-r1 instead of using mds.A and mds.B (10.2.2). The D server fails >>>> with the same assert as follows: >>> Because this system was running multiple active MDSs on Jewel (based >>> on seeing an EImportStart journal entry), and that was known to be >>> unstable, I would advise you to blow away the filesystem and create a >>> fresh one using luminous (where multi-mds is stable), rather than >>> trying to debug it. Going back to try and work out what went wrong >>> with Jewel code is probably not a very valuable activity unless you >>> have irreplacable data. >>> >>> If you do want to get this filesystem back on its feet in-place: >>> (first stopping all MDSs) I'm guessing that your cephfs-journal-tool >>> reset didn't help because you had multiple MDS ranks, and that tool >>> just operates on rank 0 by default. You need to work out which rank's >>> journal is actually damaged (it's part of the prefix to MDS log >>> messages), and then pass a --rank argument to cephfs-journal-tool. >>> You will also need to reset all the other ranks' journals to keep >>> things consistent, and then do a "ceph fs reset" so that it will start >>> up with a single MDS next time. If you get the filesystem up and >>> running again, I'd still recommend copying anything important off it >>> and creating a new one using luminous, rather than continuing to run >>> with maybe-still-subtly-damaged metadata. >>> >>> John >>> >>>> === 132+0+1979520 (4198351460 0 1611007530) 0x7fffc4000a70 con >>>> 0x7fffe0013310 >>>> 0> 2017-10-09 13:01:31.571195 7fffd99f5700 -1 mds/journal.cc: In >>>> function 'virtual void EImportStart::replay(MDSRank*)' thread >>>> 7fffd99f5700 time 2017-10-09 13:01:31.570608 >>>> mds/journal.cc: 2949: FAILED assert(mds->sessionmap.get_version() == cmapv) >>>> ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) >>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>> const*)+0x80) [0x555555b7ebc8] >>>> 2: (EImportStart::replay(MDSRank*)+0x9ea) [0x555555a5674a] >>>> 3: (MDLog::_replay_thread()+0xe51) [0x5555559cef21] >>>> 4: (MDLog::ReplayThread::entry()+0xd) [0x5555557778cd] >>>> 5: (()+0x7364) [0x7ffff7bc5364] >>>> 6: (clone()+0x6d) [0x7ffff6051ccd] >>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >>>> needed to interpret this. >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com