Re: jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

"Yan, Zheng" <ukernel@xxxxxxxxx> · Fri, 7 Oct 2016 21:31:50 +0800

On Fri, Oct 7, 2016 at 8:20 AM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> wrote:
> And - I just saw another recent thread -
> http://tracker.ceph.com/issues/17177 - can be an explanation of most/all of
> the above ?
>
> Next question(s) would then be:
>
> How would one deal with duplicate stray(s)

Here is an untested method

list omap keys in objects 600.00000000 ~ 609.00000000. find all duplicated keys

for each duplicated keys, use ceph-dencoder to decode their values,
find the one has the biggest version and delete the rest
(ceph-dencoder type inode_t skip 9 import /tmp/ decode dump_json)

Regards
Yan, Zheng

> How would one deal with mismatch between head items and fnode.fragstat, ceph
> daemon mds.foo scrub_path ?
>
> -KJ
>
> On Thu, Oct 6, 2016 at 5:05 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx>
> wrote:
>>
>> Hi,
>>
>> context (i.e. what we're doing): We're migrating (or trying to) migrate
>> off of an nfs server onto cephfs, for a workload that's best described as
>> "big piles" of hardlinks. Essentially, we have a set of "sources":
>> foo/01/<aa><rest-of-md5>
>> foo/0b/<0b><rest-of-md5>
>> .. and so on
>> bar/02/..
>> bar/0c/..
>> .. and so on
>>
>> foo/bar/friends have been "cloned" numerous times to a set of names that
>> over the course of weeks end up being recycled again, the clone is
>> essentially cp -L foo copy-1-of-foo.
>>
>> We're doing "incremental" rsyncs of this onto cephfs, so the sense of "the
>> original source of the hardlink" will end up moving around, depending on the
>> whims of rsync. (if it matters, I found some allusion to "if the original
>> file hardlinked is deleted, ...".
>>
>> For RBD the ceph cluster have mostly been rather well behaved, the
>> problems we have had have for the most part been self-inflicted. Before
>> introducing the hardlink spectacle to cephfs, the same filesystem were used
>> for light-ish read-mostly loads, beint mostly un-eventful. (That being said,
>> we did patch it for
>>
>> Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f0ae071bd06),
>> clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4.
>>
>> The problems we're facing:
>>
>> Maybe a "non-problem" I have ~6M strays sitting around
>> Slightly more problematic, I have duplicate stray(s) ? See log excercepts
>> below. Also; rados -p cephfs_metadata listomapkeys 60X.00000000 did/does
>> seem to agree with there being duplicate strays (assuming 60X.00000000 is
>> the directory indexes for the stray catalogs), caveat "not a perfect
>> snapshot", listomapkeys issued in serial fashion.
>> We stumbled across (http://tracker.ceph.com/issues/17177 - mostly here for
>> more context)
>> There's been a couple of instances of invalid backtrace(s), mostly solved
>> by either mds:scrub_path or just unlinking the files/directories in question
>> and re-rsync-ing.
>> mismatch between head items and fnode.fragstat (See below for more of the
>> log excercept), appeared to have been solved by mds:scrub_path
>>
>>
>> Duplicate stray(s), ceph-mds complains (a lot, during rsync):
>> 2016-09-30 20:00:21.978314 7ffb653b8700  0 mds.0.cache.dir(603) _fetched
>> badness: got (but i already had) [inode 10003f25eaf [...2,head]
>> ~mds0/stray0/10003f25eaf auth v38836572 s=8998 nl=5 n(v0 b8998 1=1+0)
>> (iversion lock) 0x561082e6b520] mode 33188 mtime 2016-07-25 03:02:50.000000
>> 2016-09-30 20:00:21.978336 7ffb653b8700 -1 log_channel(cluster) log [ERR]
>> : loaded dup inode 10003f25eaf [2,head] v36792929 at
>> ~mds0/stray3/10003f25eaf, but inode 10003f25eaf.head v38836572 already
>> exists at ~mds0/stray0/10003f25eaf
>>
>> I briefly ran ceph-mds with debug_mds=20/20 which didn't yield anything
>> immediately useful, beyond slightly-easier-to-follow the control-flow of
>> src/mds/CDir.cc without becoming much wiser.
>> 2016-09-30 20:43:51.910754 7ffb653b8700 20 mds.0.cache.dir(606) _fetched
>> pos 310473 marker 'I' dname '100022e8617 [2,head]
>> 2016-09-30 20:43:51.910757 7ffb653b8700 20 mds.0.cache.dir(606) lookup
>> (head, '100022e8617')
>> 2016-09-30 20:43:51.910759 7ffb653b8700 20 mds.0.cache.dir(606)   miss ->
>> (10002a81c10,head)
>> 2016-09-30 20:43:51.910762 7ffb653b8700  0 mds.0.cache.dir(606) _fetched
>> badness: got (but i already had) [inode 100022e8617 [...2,head]
>> ~mds0/stray9/100022e8617 auth v39303851 s=11470 nl=10 n(v0 b11470 1=1+0)
>> (iversion lock) 0x560c013904b8] mode 33188 mtime 2016-07-25 03:38:01.000000
>> 2016-09-30 20:43:51.910772 7ffb653b8700 -1 log_channel(cluster) log [ERR]
>> : loaded dup inode 100022e8617 [2,head] v39284583 at
>> ~mds0/stray6/100022e8617, but inode 100022e8617.head v39303851 already
>> exists at ~mds0/stray9/100022e8617
>>
>>
>> 2016-09-25 06:23:50.947761 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
>> mismatch between head items and fnode.fragstat! printing dentries
>> 2016-09-25 06:23:50.947779 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
>> get_num_head_items() = 36; fnode.fragstat.nfiles=53
>> fnode.fragstat.nsubdirs=0
>> 2016-09-25 06:23:50.947782 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
>> mismatch between child accounted_rstats and my rstats!
>> 2016-09-25 06:23:50.947803 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
>> total of child dentrys: n(v0 b19365007 36=36+0)
>> 2016-09-25 06:23:50.947806 7ffb653b8700  1 mds.0.cache.dir(10003439a33) my
>> rstats:              n(v2 rc2016-08-28 04:48:37.685854 b49447206 53=53+0)
>>
>> The slightly sad thing is - I suspect all of this is probably from
>> something that "happened at some time in the past", and running mds with
>> debugging will make my users very unhappy as writing/formatting all that log
>> is not exactly cheap. (debug_mds=20/20, quickly ended up with mds beacon
>> marked as laggy).
>>
>> Bonus question: In terms of "understanding how cephfs works" is
>> doc/dev/mds_internals it ? :) Given that making "minimal reproducible
>> test-cases" so far is turning to be quite elusive from the "top down"
>> approach, I'm finding myself looking inside the box to try to figure out how
>> we got where we are.
>>
>> (And many thanks for ceph-dencoder, it satisfies my pathological need to
>> look inside of things).
>>
>> Cheers,
>> --
>> Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
>> SRE, Medallia Inc
>> Phone: +1 (650) 739-6580
>
>
>
>
> --
> Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
> SRE, Medallia Inc
> Phone: +1 (650) 739-6580
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com