please try again debug_mds=10 and send log to me
Regards
Yan, Zheng
On Mon, Oct 29, 2018 at 6:30 PM Jon Morby (Fido) <jon@xxxxxxxx> wrote:
fyi, downgrading to 13.2.1 doesn't seem to have fixed the issue either :(--- end dump of recent events ---2018-10-29 10:27:50.440 7feb58b43700 -1 *** Caught signal (Aborted) **in thread 7feb58b43700 thread_name:md_log_replayceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)1: (()+0x3ebf40) [0x55deff8e0f40]2: (()+0x11390) [0x7feb68246390]3: (gsignal()+0x38) [0x7feb67993428]4: (abort()+0x16a) [0x7feb6799502a]5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7feb689a5630]6: (()+0x2e26a7) [0x7feb689a56a7]7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) [0x55deff8ccc8b]8: (EUpdate::replay(MDSRank*)+0x39) [0x55deff8ce1c9]9: (MDLog::_replay_thread()+0x864) [0x55deff876974]10: (MDLog::ReplayThread::entry()+0xd) [0x55deff61a95d]11: (()+0x76ba) [0x7feb6823c6ba]12: (clone()+0x6d) [0x7feb67a6541d]NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.--- begin dump of recent events ---0> 2018-10-29 10:27:50.440 7feb58b43700 -1 *** Caught signal (Aborted) **in thread 7feb58b43700 thread_name:md_log_replayceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)1: (()+0x3ebf40) [0x55deff8e0f40]2: (()+0x11390) [0x7feb68246390]3: (gsignal()+0x38) [0x7feb67993428]4: (abort()+0x16a) [0x7feb6799502a]5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7feb689a5630]6: (()+0x2e26a7) [0x7feb689a56a7]7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) [0x55deff8ccc8b]8: (EUpdate::replay(MDSRank*)+0x39) [0x55deff8ce1c9]9: (MDLog::_replay_thread()+0x864) [0x55deff876974]10: (MDLog::ReplayThread::entry()+0xd) [0x55deff61a95d]11: (()+0x76ba) [0x7feb6823c6ba]12: (clone()+0x6d) [0x7feb67a6541d]NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.--- logging levels ---0/ 5 none0/ 0 lockdep0/ 0 context0/ 0 crush3/ 3 mds1/ 5 mds_balancer1/ 5 mds_locker1/ 5 mds_log1/ 5 mds_log_expire1/ 5 mds_migrator0/ 0 buffer0/ 0 timer0/ 0 filer0/ 1 striper0/ 0 objecter0/ 0 rados0/ 0 rbd0/ 5 rbd_mirror0/ 5 rbd_replay0/ 0 journaler0/ 5 objectcacher0/ 0 client0/ 0 osd0/ 0 optracker0/ 0 objclass0/ 0 filestore0/ 0 journal0/ 0 ms0/ 0 mon0/ 0 monc0/ 0 paxos0/ 0 tp0/ 0 auth1/ 5 crypto0/ 0 finisher1/ 1 reserver0/ 0 heartbeatmap0/ 0 perfcounter0/ 0 rgw1/ 5 rgw_sync1/10 civetweb1/ 5 javaclient0/ 0 asok0/ 0 throttle0/ 0 refs1/ 5 xio1/ 5 compressor1/ 5 bluestore1/ 5 bluefs1/ 3 bdev1/ 5 kstore4/ 5 rocksdb4/ 5 leveldb4/ 5 memdb1/ 5 kinetic1/ 5 fuse1/ 5 mgr1/ 5 mgrc1/ 5 dpdk1/ 5 eventtrace99/99 (syslog threshold)-1/-1 (stderr threshold)max_recent 10000max_new 1000log_file /var/log/ceph/ceph-mds.mds04.log--- end dump of recent events -------- On 29 Oct, 2018, at 09:25, Jon Morby <jon@xxxxxxxx> wrote:HiIdeally we'd like to undo the whole accidental upgrade to 13.x and ensure that ceph-deploy doesn't do another major release upgrade without a lot of warningsEither way, I'm currently getting errors that 13.2.1 isn't available / shaman is offline / etc
What's the best / recommended way of doing this downgrade across our estate?
----- On 29 Oct, 2018, at 08:19, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
We backported a wrong patch to 13.2.2. downgrade ceph to 13.2.1, then run 'ceph mds repaired fido_fs:1" .Sorry for the troubleYan, ZhengOn Mon, Oct 29, 2018 at 7:48 AM Jon Morby <jon@xxxxxxxx> wrote:_______________________________________________We accidentally found ourselves upgraded from 12.2.8 to 13.2.2 after a ceph-deploy install went awry (we were expecting it to upgrade to 12.2.9 and not jump a major release without warning)Anyway .. as a result, we ended up with an mds journal error and 1 daemon reporting as damagedHaving got nowhere trying to ask for help on irc, we've followed various forum posts and disaster recovery guides, we ended up resetting the journal which left the daemon as no longer “damaged” however we’re now seeing mds segfault whilst trying to replay/build/ceph-13.2.2/src/mds/journal.cc: 1572: FAILED assert(g_conf->mds_wipe_sessions)ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fad637f70f2]2: (()+0x3162b7) [0x7fad637f72b7]3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) [0x7a7a6b]4: (EUpdate::replay(MDSRank*)+0x39) [0x7a8fa9]5: (MDLog::_replay_thread()+0x864) [0x752164]6: (MDLog::ReplayThread::entry()+0xd) [0x4f021d]7: (()+0x76ba) [0x7fad6305a6ba]8: (clone()+0x6d) [0x7fad6288341d]NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.full logsWe’ve been unable to access the cephfs file system since all of this started …. attempts to mount fail with reports that “mds probably not available”Oct 28 23:47:02 mirrors kernel: [115602.911193] ceph: probably no mds server is uproot@mds02:~# ceph -scluster:id: 78d5bf7d-b074-47ab-8d73-bd4d99df98a5health: HEALTH_WARN1 filesystem is degradedinsufficient standby MDS daemons availabletoo many PGs per OSD (276 > max 250)services:mon: 3 daemons, quorum mon01,mon02,mon03mgr: mon01(active), standbys: mon02, mon03mds: fido_fs-2/2/1 up {0=mds01=up:resolve,1=mds02=up:replay(laggy or crashed)}osd: 27 osds: 27 up, 27 indata:pools: 15 pools, 3168 pgsobjects: 16.97 M objects, 30 TiBusage: 71 TiB used, 27 TiB / 98 TiB availpgs: 3168 active+cleanio:client: 680 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 345 op/s wrBefore I just trash the entire fs and give up on ceph, does anyone have any suggestions as to how we can fix this?root@mds02:~# ceph versions{"mon": {"ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)": 3},"mgr": {"ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)": 3},"osd": {"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 27},"mds": {"ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)": 2},"overall": {"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 27,"ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)": 8}}
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--Jon Morby
FidoNet - the internet made simple!
10 - 16 Tiller Road, London, E14 8PX
tel: 0345 004 3050 / fax: 0345 004 3051Need more rack space?
Check out our Co-Lo offerings at http://www.fido.net/services/colo/ 32 amp racks in London and Brighton
Linx ConneXions available at all Fido sites! https://www.fido.net/services/backbone/connexions/PGP Key: 26DC B618 DE9E F9CB F8B7 1EFA 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com--Jon Morby
FidoNet - the internet made simple!
10 - 16 Tiller Road, London, E14 8PX
tel: 0345 004 3050 / fax: 0345 004 3051Need more rack space?
Check out our Co-Lo offerings at http://www.fido.net/services/colo/ 32 amp racks in London and Brighton
Linx ConneXions available at all Fido sites! https://www.fido.net/services/backbone/connexions/PGP Key: 26DC B618 DE9E F9CB F8B7 1EFA 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com