jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> · Thu, 6 Oct 2016 17:05:16 -0700

Hi,
context (i.e. what we're doing): We're migrating (or trying to) migrate off of an nfs server onto cephfs, for a workload that's best described as "big piles" of hardlinks. Essentially, we have a set of "sources":
foo/01/<aa><rest-of-md5>
foo/0b/<0b><rest-of-md5>
.. and so on
bar/02/..
bar/0c/..
.. and so on

foo/bar/friends have been "cloned" numerous times to a set of names that over the course of weeks end up being recycled again, the clone is essentially cp -L foo copy-1-of-foo.

We're doing "incremental" rsyncs of this onto cephfs, so the sense of "the original source of the hardlink" will end up moving around, depending on the whims of rsync. (if it matters, I found some allusion to "if the original file hardlinked is deleted, ...".

For RBD the ceph cluster have mostly been rather well behaved, the problems we have had have for the most part been self-inflicted. Before introducing the hardlink spectacle to cephfs, the same filesystem were used for light-ish read-mostly loads, beint mostly un-eventful. (That being said, we did patch it for

Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f0ae071bd06), clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4.

The problems we're facing:
Maybe a "non-problem" I have ~6M strays sitting around
Slightly more problematic, I have duplicate stray(s) ? See log excercepts below. Also; rados -p cephfs_metadata listomapkeys 60X.00000000 did/does seem to agree with there being duplicate strays (assuming 60X.00000000 is the directory indexes for the stray catalogs), caveat "not a perfect snapshot", listomapkeys issued in serial fashion.
We stumbled across (http://tracker.ceph.com/issues/17177 - mostly here for more context)
There's been a couple of instances of invalid backtrace(s), mostly solved by either mds:scrub_path or just unlinking the files/directories in question and re-rsync-ing.
mismatch between head items and fnode.fragstat (See below for more of the log excercept), appeared to have been solved by mds:scrub_path

Duplicate stray(s), ceph-mds complains (a lot, during rsync):
2016-09-30 20:00:21.978314 7ffb653b8700  0 mds.0.cache.dir(603) _fetched  badness: got (but i already had) [inode 10003f25eaf [...2,head] ~mds0/stray0/10003f25eaf auth v38836572 s=8998 nl=5 n(v0 b8998 1=1+0) (iversion lock) 0x561082e6b520] mode 33188 mtime 2016-07-25 03:02:50.000000
2016-09-30 20:00:21.978336 7ffb653b8700 -1 log_channel(cluster) log [ERR] : loaded dup inode 10003f25eaf [2,head] v36792929 at ~mds0/stray3/10003f25eaf, but inode 10003f25eaf.head v38836572 already exists at ~mds0/stray0/10003f25eaf

I briefly ran ceph-mds with debug_mds=20/20 which didn't yield anything immediately useful, beyond slightly-easier-to-follow the control-flow of src/mds/CDir.cc without becoming much wiser.
2016-09-30 20:43:51.910754 7ffb653b8700 20 mds.0.cache.dir(606) _fetched pos 310473 marker 'I' dname '100022e8617 [2,head]
2016-09-30 20:43:51.910757 7ffb653b8700 20 mds.0.cache.dir(606) lookup (head, '100022e8617')
2016-09-30 20:43:51.910759 7ffb653b8700 20 mds.0.cache.dir(606)   miss -> (10002a81c10,head)
2016-09-30 20:43:51.910762 7ffb653b8700  0 mds.0.cache.dir(606) _fetched  badness: got (but i already had) [inode 100022e8617 [...2,head] ~mds0/stray9/100022e8617 auth v39303851 s=11470 nl=10 n(v0 b11470 1=1+0) (iversion lock) 0x560c013904b8] mode 33188 mtime 2016-07-25 03:38:01.000000
2016-09-30 20:43:51.910772 7ffb653b8700 -1 log_channel(cluster) log [ERR] : loaded dup inode 100022e8617 [2,head] v39284583 at ~mds0/stray6/100022e8617, but inode 100022e8617.head v39303851 already exists at ~mds0/stray9/100022e8617

2016-09-25 06:23:50.947761 7ffb653b8700  1 mds.0.cache.dir(10003439a33) mismatch between head items and fnode.fragstat! printing dentries
2016-09-25 06:23:50.947779 7ffb653b8700  1 mds.0.cache.dir(10003439a33) get_num_head_items() = 36; fnode.fragstat.nfiles=53 fnode.fragstat.nsubdirs=0
2016-09-25 06:23:50.947782 7ffb653b8700  1 mds.0.cache.dir(10003439a33) mismatch between child accounted_rstats and my rstats!
2016-09-25 06:23:50.947803 7ffb653b8700  1 mds.0.cache.dir(10003439a33) total of child dentrys: n(v0 b19365007 36=36+0)
2016-09-25 06:23:50.947806 7ffb653b8700  1 mds.0.cache.dir(10003439a33) my rstats:              n(v2 rc2016-08-28 04:48:37.685854 b49447206 53=53+0)

The slightly sad thing is - I suspect all of this is probably from something that "happened at some time in the past", and running mds with debugging will make my users very unhappy as writing/formatting all that log is not exactly cheap. (debug_mds=20/20, quickly ended up with mds beacon marked as laggy).

Bonus question: In terms of "understanding how cephfs works" is doc/dev/mds_internals it ? :) Given that making "minimal reproducible test-cases" so far is turning to be quite elusive from the "top down" approach, I'm finding myself looking inside the box to try to figure out how we got where we are.

(And many thanks for ceph-dencoder, it satisfies my pathological need to look inside of things).

Cheers,
-- 
Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc
Phone: +1 (650) 739-6580

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com