Re: Problematic inode preventing ceph-mds from starting

"Pickett, Neale T" <neale@xxxxxxxx> · Fri, 25 Oct 2019 19:11:08 +0000

Hello Patrick Donnelly, Ph.D. Thank you very much for your
 response.

After removing these objects, the mds does start up correctly.
 But it doesn't take long until it goes into a crash loop again.

In the last week we have made a few changes to the down filesystem
 in an attempt to fix what we thought was an inode problem:

cephfs-data-scan scan_extents   # about 1 day with 64 processes
cephfs-data-scan scan_inodes   # about 1 day with 64 processes
cephfs-data_scan scan_links   # about 1 day

After these three, we tried to start an MDS and it stayed up. We then ran:

ceph tell mds.a scrub start / recursive repair

The repair ran about 3 days, spewing logs to `ceph -w` about duplicated inodes, until it stopped. All looked well until we began bringing production services back online, at which point many error messages appeared, the mds
 went back into damaged, and the fs back to degraded. At this point I removed the objects you suggested, which brought everything back briefly.

The latest crash is:

    -1> 2019-10-25 18:47:50.731 7fc1f3b56700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc: In function
 'void MDCache::add_inode(CInode*)' thread 7fc1f3b56700 time 2019-1...

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc: 258: FAILED ceph_assert(!p)

Messages appearing in `ceph -w`:

2019-10-25 18:42:47.398631 mon.coffee [INF] daemon mds.coffee is now active in filesystem cephfs as rank 0

2019-10-25 18:47:02.496206 mds.coffee [ERR] loaded dup inode 0x1000098aa6b [2,head] v6685 at /srv/prod/gitlab/data/gitlab-rails/etc/.chef-smtp_settings20191025-30-ke5vsa.rb, but inode 0x1000098aa6b.head v2084 already exists at /home/apope/projects/fame-docker/data/rabbitmq/mnesia/rabbit@apope_fame-docker_rabbitmq.1/LATEST.LOG

2019-10-25 18:47:16.263440 mds.coffee [ERR] loaded dup inode 0x1000ff2c320 [2,head] v220594822 at /srv/prod/gitlab/data/postgresql/data/pg_stat_tmp/db_16400.stat, but inode 0x1000ff2c320.head v220594841 already exists at /srv/prod/gitlab/data/postgresql/data/pg_stat_tmp/db_16400.tmp

2019-10-25 18:47:16.263507 mds.coffee [ERR] loaded dup inode 0x1000ff2c31d [2,head] v220594820 at /srv/prod/gitlab/data/postgresql/data/pg_stat_tmp/db_12405.stat, but inode 0x1000ff2c31d.head v220594843 already exists at /srv/prod/gitlab/data/postgresql/data/pg_stat_tmp/db_12405.tmp

2019-10-25 18:47:16.263598 mds.coffee [ERR] loaded dup inode 0x1000ff2c2fe [2,head] v220594695 at /srv/prod/gitlab/data/postgresql/data/pg_stat_tmp/db_1.stat, but inode 0x1000ff2c2fe.head v220594842 already exists at /srv/prod/gitlab/data/postgresql/data/pg_stat_tmp/db_1.tmp

2019-10-25 18:47:16.263682 mds.coffee [ERR] loaded dup inode 0x1000ff2c323 [2,head] v220594839 at /srv/prod/gitlab/data/postgresql/data/pg_stat_tmp/db_0.stat, but inode 0x1000ff2c323.head v220594831 already exists at /srv/prod/gitlab/data/postgresql/data/pg_stat_tmp/db_0.tmp

2019-10-25 18:47:18.639076 mds.coffee [ERR] unmatched fragstat size on single dirfrag 0x10000de30c9, inode has f(v1337 m2019-10-25 18:47:18.637644 4=4+0), dirfrag has f(v1337 m2019-10-25 18:47:18.637644 5=5+0)

2019-10-25 18:47:18.639098 mds.coffee [ERR] unmatched rstat rbytes on single dirfrag 0x10000de30c9, inode has n(v57092 rc2019-10-25 18:47:18.637644 b267050 5=4+1), dirfrag has n(v57092 rc2019-10-25 18:47:18.637644 b269722 5=5+0)

2019-10-25 18:48:04.763712 mon.coffee [WRN] daemon mds.coffee is not responding, replacing it as rank 0 with standby daemon mds.frigga

2019-10-25 18:48:04.799983 mon.coffee [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)

2019-10-25 18:48:05.393098 mds.frigga [ERR]  replayed ESubtreeMap at 30403632686996 subtree root 0x1 is not mine in cache (it's -2,-2)

2019-10-25 18:48:05.602690 mds.frigga [ERR]  replayed ESubtreeMap at 30403633475072 subtree root 0x1 is not mine in cache (it's -2,-2)

2019-10-25 18:48:05.611228 mds.frigga [ERR]  replayed ESubtreeMap at 30403634542522 subtree root 0x1 is not mine in cache (it's -2,-2)

2019-10-25 18:48:05.619702 mds.frigga [ERR]  replayed ESubtreeMap at 30403635686154 subtree root 0x1 is not mine in cache (it's -2,-2)

2019-10-25 18:48:05.628950 mds.frigga [ERR]  replayed ESubtreeMap at 30403636977978 subtree root 0x1 is not mine in cache (it's -2,-2)

2019-10-25 18:48:05.701862 mds.frigga [ERR]  replayed ESubtreeMap at 30403637601359 subtree root 0x1 is not mine in cache (it's -2,-2)

2019-10-25 18:48:05.707658 mds.frigga [ERR]  replayed ESubtreeMap at 30403638411960 subtree root 0x1 is not mine in cache (it's -2,-2)

2019-10-25 18:48:05.719397 mds.frigga [ERR]  replayed ESubtreeMap at 30403640128321 subtree root 0x1 is not mine in cache (it's -2,-2)

2019-10-25 18:48:05.727447 mds.frigga [ERR]  replayed ESubtreeMap at 30403641312085 subtree root 0x1 is not mine in cache (it's -2,-2)

2019-10-25 18:48:05.728678 mds.frigga [ERR]  replayed ESubtreeMap at 30403641478175 subtree root 0x1 is not mine in cache (it's -2,-2)

2019-10-25 18:48:05.734157 mds.frigga [ERR] failure replaying journal (EMetaBlob)

2019-10-25 18:48:06.676985 mon.coffee [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)

2019-10-25 18:48:06.677063 mon.coffee [ERR] Health check failed: 1 mds daemon damaged (MDS_DAMAGE)

We now show

    mds: cephfs:0/1 5 up:standby, 1 damaged

(we only have 5 mds servers)

This still smells like an inode problem to me, but I have completely run out of ideas, so I will do nothing more to ceph as I anxoiusly hope I am not fired for this 14-days-and-counting outage while awaiting a reply
 from the list.

Thank you very much!

Neale

From: Patrick Donnelly <pdonnell@xxxxxxxxxx>

Sent: Thursday, October 24, 2019 17:57

To: Pickett, Neale T

Cc: ceph-users

Subject: Re:  Problematic inode preventing ceph-mds from starting

Hi Neale,

On Fri, Oct 18, 2019 at 9:31 AM Pickett, Neale T <neale@xxxxxxxx> wrote:

>

> Last week I asked about a rogue inode that was causing ceph-mds to segfault during replay. We didn't get any suggestions from this list, so we have been familiarizing ourselves with the ceph source code, and have added the following patch:

>

>

>

> --- a/src/mds/CInode.cc

> +++ b/src/mds/CInode.cc

> @@ -736,6 +736,13 @@ CDir *CInode::get_approx_dirfrag(frag_t fg)

>

>  CDir *CInode::get_or_open_dirfrag(MDCache *mdcache, frag_t fg)

>  {

> +  if (!is_dir()) {

> +    ostringstream oss;

> +    JSONFormatter f(true);

f.open_object_section("inode");

(otherwise the output is hard to read)

> +    dump(&f, DUMP_PATH | DUMP_INODE_STORE_BASE | DUMP_MDS_CACHE_OBJECT | DUMP_LOCKS | DUMP_STATE | DUMP_CAPS | DUMP_DIRFRAGS);

f.close_object_section("inode")

> +    f.flush(oss);

> +    dout(0) << oss.str() << dendl;

> +  }

>    ceph_assert(is_dir());

>

>    // have it?

This seems like a generally useful patch. Feel free to submit a PR.

>

> This has given us a culprit:

>

>

>

>     -2> 2019-10-18 16:19:06.934 7faefa470700  0 mds.0.cache.ino(0x10000995e63) "/unimportant/path/we/can/tolerate/losing/compat.py"10995216789470"2018-03-24 03:18:17.621969""2018-03-24 03:18:17.620969"3318855521001{

>     "dir_hash": 0

> }

> {

>     "stripe_unit": 4194304,

>     "stripe_count": 1,

>     "object_size": 4194304,

>     "pool_id": 1,

>     "pool_ns": ""

> }

> []

> 3411844674407370955161500"2015-01-27 16:01:52.467669""2018-03-24 03:18:17.621969"21-1[]

> {

>     "version": 0,

>     "mtime": "0.000000",

>     "num_files": 0,

>     "num_subdirs": 0

> }

> {

>     "version": 0,

>     "rbytes": 34,

>     "rfiles": 1,

>     "rsubdirs": 0,

>     "rsnaps": 0,

>     "rctime": "0.000000"

> }

> {

>     "version": 0,

>     "rbytes": 34,

>     "rfiles": 1,

>     "rsubdirs": 0,

>     "rsnaps": 0,

>     "rctime": "0.000000"

> }

> 2540123""""[]

> {

>     "splits": []

> }

> true{

>     "replicas": {}

> }

> {

>     "authority": [

>         0,

>         -2

>     ],

>     "replica_nonce": 0

> }

> 0falsefalse{}

> 0{

>     "gather_set": [],

>     "state": "lock",

>     "is_leased": false,

>     "num_rdlocks": 0,

>     "num_wrlocks": 0,

>     "num_xlocks": 0,

>     "xlock_by": {}

> }

> {}

> {}

> {}

> {}

> {}

> {}

> {}

> {}

> {}

> [

>     "auth"

> ]

> []

> -1-1[]

> []

>

>     -1> 2019-10-18 16:19:06.964 7faefa470700 -1 /opt/app-root/src/ceph/src/mds/CInode.cc: In function 'CDir* CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 7faefa470700 time 2019-10-18 16:19:06.934662

> /opt/app-root/src/ceph/src/mds/CInode.cc: 746: FAILED ceph_assert(is_dir())

>

>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)

>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x7faf0a9ce39e]

>  2: (()+0x12a8620) [0x7faf0a9ce620]

>  3: (CInode::get_or_open_dirfrag(MDCache*, frag_t)+0x253) [0x557562a4b1ad]

>  4: (OpenFileTable::_prefetch_dirfrags()+0x4db) [0x557562b63d63]

>  5: (OpenFileTable::_open_ino_finish(inodeno_t, int)+0x16a) [0x557562b63720]

>  6: (C_OFT_OpenInoFinish::finish(int)+0x2d) [0x557562b67699]

>  7: (Context::complete(int)+0x27) [0x557562657fbf]

>  8: (MDSContext::complete(int)+0x152) [0x557562b04aa4]

>  9: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x2c8) [0x557562660e36]

>  10: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, int)+0x185) [0x557562844c4d]

>  11: (MDCache::_open_ino_backtrace_fetched(inodeno_t, ceph::buffer::v14_2_0::list&, int)+0xbbf) [0x557562842785]

>  12: (C_IO_MDC_OpenInoBacktraceFetched::finish(int)+0x37) [0x557562886a31]

>  13: (Context::complete(int)+0x27) [0x557562657fbf]

>  14: (MDSContext::complete(int)+0x152) [0x557562b04aa4]

>  15: (MDSIOContextBase::complete(int)+0x345) [0x557562b0522d]

>  16: (Finisher::finisher_thread_entry()+0x38b) [0x7faf0a9033e1]

>  17: (Finisher::FinisherThread::entry()+0x1c) [0x5575626a2772]

>  18: (Thread::entry_wrapper()+0x78) [0x7faf0a97203c]

>  19: (Thread::_entry_func(void*)+0x18) [0x7faf0a971fba]

>  20: (()+0x7dd5) [0x7faf07844dd5]

>  21: (clone()+0x6d) [0x7faf064f502d]

>

> I tried removing it, but it does not show up in the omapkeys for that inode:

>

> lima:/home/neale$ ceph -- rados -p cephfs_metadata listomapkeys 10000995e63.00000000

> __about__.py_head

> __init__.py_head

> __pycache___head

> _compat.py_head

> _structures.py_head

> markers.py_head

> requirements.py_head

> specifiers.py_head

> utils.py_head

> version.py_head

> lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 10000995e63.00000000 _compat.py_head

> lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 10000995e63.00000000 compat.py_head

> lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 10000995e63.00000000 file-does-not-exist_head

> lima:/home/neale$ ceph -- rados -p cephfs_metadata listomapkeys 10000995e63.00000000

> __about__.py_head

> __init__.py_head

> __pycache___head

> _structures.py_head

> markers.py_head

> requirements.py_head

> specifiers.py_head

> utils.py_head

> version.py_head

>

> Predictably, this did nothing to solve our problem, and ceph-mds is still dying during startup.

>

> Any suggestions?

Looks like the openfiletable is corrupt. It is not necessary to start

the FS. You can simply delete all of the metadata pool objects

matching this format: "mds%d_openfiles.%x". No data loss will occur.

-- 

Patrick Donnelly, Ph.D.

He / Him / His

Senior Software Engineer

Red Hat Sunnyvale, CA

GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com