I have a semi-corrupted cephfs filesystem (most directories are OK, but a few are broken). Trying to read or delete anything from the broken directories causes the MDS servers to crash, I have followed all of the disaster recovery steps, but I still cannot keep the MDS servers up and there are still corrupt directories in the FS.
I can usually get the MDS to come back if I run "cephfs-data-scan scan_links" a couple of times, but it's not consistent. Any suggestions on how to resolve this issue?
The mds crashes with the following traces in the log:
-401> 2019-09-11 13:43:24.768 7fa71112b700 0 log_channel(cluster) do_log log to syslog
-401> 2019-09-11 13:43:24.768 7fa71112b700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x100146dfe3b
-401> 2019-09-11 13:43:24.768 7fa71112b700 0 log_channel(cluster) do_log log to syslog
-401> 2019-09-11 13:43:24.768 7fa719aeb700 1 -- 10.10.30.115:6800/1442163404 <== osd.139 10.10.30.51:6800/142614 1 ==== osd_op_reply(84 100148d4cf3.00000000 [omap-get-header,omap-get-vals,getxattr (94)] v0'0 uv35566 _ondisk_ = 0) v8 ==== 248+0+5722 (1809603995 0 3125985462) 0x8b5c340 con 0x5cc7800
-401> 2019-09-11 13:43:24.772 7fa71aaed700 1 -- 10.10.30.115:6800/1442163404 <== osd.76 10.10.30.55:6833/15548 1 ==== osd_op_reply(80 100146dfe3d.00000000 [omap-get-header,omap-get-vals,getxattr] v0'0 uv37154 _ondisk_ = 0) v8 ==== 248+0+3667 (486108846 0 420775557) 0x2d30700 con 0x5bb8300
-401> 2019-09-11 13:43:24.772 7fa71112b700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x100146dfe3d
....
-401> 2019-09-11 13:43:25.844 7fa71292e700 -1 /build/ceph-13.2.6/src/mds/Server.cc: In function 'void Server::_unlink_local(MDRequestRef&, CDentry*, CDentry*)' thread 7fa71292e700 time 2019-09-11 13:43:25.843472
/build/ceph-13.2.6/src/mds/Server.cc: 6599: FAILED assert(in->first <= straydn->first)
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7fa71f7a597e]
2: (()+0x2fab07) [0x7fa71f7a5b07]
3: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x15e8) [0x548fa8]
4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0x961) [0x549991]
5: (Server::handle_client_request(MClientRequest*)+0x49b) [0x563beb]
6: (Server::dispatch(Message*)+0x2fb) [0x5678cb]
7: (MDSRank::handle_deferrable_message(Message*)+0x434) [0x4da3c4]
8: (MDSRank::_dispatch(Message*, bool)+0x89b) [0x4f17db]
9: (MDSRank::retry_dispatch(Message*)+0x12) [0x4f1ec2]
10: (MDSInternalContextBase::complete(int)+0x67) [0x74faf7]
11: (MDSRank::_advance_queues()+0xf1) [0x4f0781]
12: (MDSRank::ProgressThread::entry()+0x43) [0x4f0e03]
13: (()+0x76ba) [0x7fa71f0216ba]
14: (clone()+0x6d) [0x7fa71e84a41d]
-401> 2019-09-11 13:43:25.844 7fa719aeb700 1 -- 10.10.30.115:6800/1442163404 <== osd.49 10.10.30.56:6838/15753 3 ==== osd_op_reply(90 600.00000000 [omap-get-header,omap-get-vals,getxattr (62)] v0'0 uv98420 _ondisk_ = 0) v8 ==== 240+0+437012 (2786733188 0 4243776564) 0x8b5e080 con 0x3addc00
-401> 2019-09-11 13:43:25.848 7fa71292e700 -1 *** Caught signal (Aborted) **
in thread 7fa71292e700 thread_name:mds_rank_progr
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (()+0x11390) [0x7fa71f02b390]
2: (gsignal()+0x38) [0x7fa71e778428]
3: (abort()+0x16a) [0x7fa71e77a02a]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7fa71f7a5a86]
5: (()+0x2fab07) [0x7fa71f7a5b07]
6: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x15e8) [0x548fa8]
7: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0x961) [0x549991]
8: (Server::handle_client_request(MClientRequest*)+0x49b) [0x563beb]
9: (Server::dispatch(Message*)+0x2fb) [0x5678cb]
10: (MDSRank::handle_deferrable_message(Message*)+0x434) [0x4da3c4]
11: (MDSRank::_dispatch(Message*, bool)+0x89b) [0x4f17db]
12: (MDSRank::retry_dispatch(Message*)+0x12) [0x4f1ec2]
13: (MDSInternalContextBase::complete(int)+0x67) [0x74faf7]
14: (MDSRank::_advance_queues()+0xf1) [0x4f0781]
15: (MDSRank::ProgressThread::entry()+0x43) [0x4f0e03]
16: (()+0x76ba) [0x7fa71f0216ba]
17: (clone()+0x6d) [0x7fa71e84a41d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7fa71f7a597e]
2: (()+0x2fab07) [0x7fa71f7a5b07]
3: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x15e8) [0x548fa8]
4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0x961) [0x549991]
5: (Server::handle_client_request(MClientRequest*)+0x49b) [0x563beb]
6: (Server::dispatch(Message*)+0x2fb) [0x5678cb]
7: (MDSRank::handle_deferrable_message(Message*)+0x434) [0x4da3c4]
8: (MDSRank::_dispatch(Message*, bool)+0x89b) [0x4f17db]
9: (MDSRank::retry_dispatch(Message*)+0x12) [0x4f1ec2]
10: (MDSInternalContextBase::complete(int)+0x67) [0x74faf7]
11: (MDSRank::_advance_queues()+0xf1) [0x4f0781]
12: (MDSRank::ProgressThread::entry()+0x43) [0x4f0e03]
13: (()+0x76ba) [0x7fa71f0216ba]
14: (clone()+0x6d) [0x7fa71e84a41d]
-401> 2019-09-11 13:43:25.844 7fa719aeb700 1 -- 10.10.30.115:6800/1442163404 <== osd.49 10.10.30.56:6838/15753 3 ==== osd_op_reply(90 600.00000000 [omap-get-header,omap-get-vals,getxattr (62)] v0'0 uv98420 _ondisk_ = 0) v8 ==== 240+0+437012 (2786733188 0 4243776564) 0x8b5e080 con 0x3addc00
-401> 2019-09-11 13:43:25.848 7fa71292e700 -1 *** Caught signal (Aborted) **
in thread 7fa71292e700 thread_name:mds_rank_progr
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (()+0x11390) [0x7fa71f02b390]
2: (gsignal()+0x38) [0x7fa71e778428]
3: (abort()+0x16a) [0x7fa71e77a02a]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7fa71f7a5a86]
5: (()+0x2fab07) [0x7fa71f7a5b07]
6: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x15e8) [0x548fa8]
7: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0x961) [0x549991]
8: (Server::handle_client_request(MClientRequest*)+0x49b) [0x563beb]
9: (Server::dispatch(Message*)+0x2fb) [0x5678cb]
10: (MDSRank::handle_deferrable_message(Message*)+0x434) [0x4da3c4]
11: (MDSRank::_dispatch(Message*, bool)+0x89b) [0x4f17db]
12: (MDSRank::retry_dispatch(Message*)+0x12) [0x4f1ec2]
13: (MDSInternalContextBase::complete(int)+0x67) [0x74faf7]
14: (MDSRank::_advance_queues()+0xf1) [0x4f0781]
15: (MDSRank::ProgressThread::entry()+0x43) [0x4f0e03]
16: (()+0x76ba) [0x7fa71f0216ba]
17: (clone()+0x6d) [0x7fa71e84a41d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx