Re: jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> · Fri, 7 Oct 2016 11:02:33 -0700

On Fri, Oct 7, 2016 at 4:46 AM, John Spray <jspray@xxxxxxxxxx> wrote:
On Fri, Oct 7, 2016 at 1:05 AM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> wrote:

> Hi,

>

> context (i.e. what we're doing): We're migrating (or trying to) migrate off

> of an nfs server onto cephfs, for a workload that's best described as "big

> piles" of hardlinks. Essentially, we have a set of "sources":

> foo/01/<aa><rest-of-md5>

> foo/0b/<0b><rest-of-md5>

> .. and so on

> bar/02/..

> bar/0c/..

> .. and so on

>

> foo/bar/friends have been "cloned" numerous times to a set of names that

> over the course of weeks end up being recycled again, the clone is

> essentially cp -L foo copy-1-of-foo.

>

> We're doing "incremental" rsyncs of this onto cephfs, so the sense of "the

> original source of the hardlink" will end up moving around, depending on the

> whims of rsync. (if it matters, I found some allusion to "if the original

> file hardlinked is deleted, ...".

This might not be much help but... have you thought about making your

application use hardlinks less aggressively?  They have an intrinsinc

overhead in any system that stores inodes locally to directories (like

we do) because you have to take an extra step to resolve them.

Under "normal" circumstances, this isn't "all that bad", the serious hammering is
coming from trying migrate to cephfs, where I think we've for the time being
abandoned using hardlinks and take the space-penalty for now. Under "normal"
circumstances it isn't that bad (if my nfs-server stats is to be believed, it's between
5e5 - and 1.5e6 hardlinks created and unlinked per day, it actually seems a bit low).

In CephFS, resolving a hard link involves reading the dentry (where we

would usually have the inode inline), and then going and finding an

object from the data pool by the inode number, reading the "backtrace"

(i.e.path) from that object and then going back to the metadata pool

to traverse that path.  It's all very fast if your metadata fits in

your MDS cache, but will slow down a lot otherwise, especially as your

metadata IOs are now potentially getting held up by anything hammering

your data pool.

By the way, if your workload is relatively little code and you can

share it, it sounds like it would be a useful hardlink stress test for

our test suite

I'll let you know if I manage to reproduce, I'm on-and-off-again trying to tease this
out on a separate ceph cluster with a "synthetic" load that's close to equivalent.

...

> For RBD the ceph cluster have mostly been rather well behaved, the problems

> we have had have for the most part been self-inflicted. Before introducing

> the hardlink spectacle to cephfs, the same filesystem were used for

> light-ish read-mostly loads, beint mostly un-eventful. (That being said, we

> did patch it for

>

> Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f0ae071bd06),

> clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4.

>

> The problems we're facing:

>

> Maybe a "non-problem" I have ~6M strays sitting around

So as you hint above, when the original file is deleted, the inode

goes into a stray dentry.  The next time someone reads the file via

one of its other links, the inode gets "reintegrated" (via

eval_remote_stray()) into the dentry it was read from.

> Slightly more problematic, I have duplicate stray(s) ? See log excercepts

> below. Also; rados -p cephfs_metadata listomapkeys 60X.00000000 did/does

> seem to agree with there being duplicate strays (assuming 60X.00000000 is

> the directory indexes for the stray catalogs), caveat "not a perfect

> snapshot", listomapkeys issued in serial fashion.

> We stumbled across (http://tracker.ceph.com/issues/17177 - mostly here for

> more context)

When you say you stumbled across it, do you mean that you actually had

this same deep scrub error on your system, or just that you found the

ticket?

No - we have done "ceph pg repair", as we did end up with single degraded objects
in the metadata pool during heavy rsync of "lot of hardlinks".

> There's been a couple of instances of invalid backtrace(s), mostly solved by

> either mds:scrub_path or just unlinking the files/directories in question

> and re-rsync-ing.

>

> mismatch between head items and fnode.fragstat (See below for more of the

> log excercept), appeared to have been solved by mds:scrub_path

>

>

> Duplicate stray(s), ceph-mds complains (a lot, during rsync):

> 2016-09-30 20:00:21.978314 7ffb653b8700  0 mds.0.cache.dir(603) _fetched

> badness: got (but i already had) [inode 10003f25eaf [...2,head]

> ~mds0/stray0/10003f25eaf auth v38836572 s=8998 nl=5 n(v0 b8998 1=1+0)

> (iversion lock) 0x561082e6b520] mode 33188 mtime 2016-07-25 03:02:50.000000

> 2016-09-30 20:00:21.978336 7ffb653b8700 -1 log_channel(cluster) log [ERR] :

> loaded dup inode 10003f25eaf [2,head] v36792929 at ~mds0/stray3/10003f25eaf,

> but inode 10003f25eaf.head v38836572 already exists at

> ~mds0/stray0/10003f25eaf

Is your workload doing lots of delete/create cycles of hard links to

the same inode?

Yes. Essentially, every few days we create a snapshot of our applications
state and turn into templates that can be deployed for testing. The snapshot
contains among other thing(s) this tree of files/hardlinks. What we hardlink
have individual files never mutate, they're either created or unlinked. The
templates are instantiated a number of times (where we hardlink back to the
templates) and used for testing, some live 2 hours some live months/years.
When we do create the snapshots, we hardlink back again to the previous
snapshot where possible, and the previous snapshot falls off a cliff when it's
2 cycles old. So the "origin file" slides over time. (For NFS-exported-ext4, this
worked out fabulously, as it saved us some terabytes and some amount of
network IO).

I wonder if we are seeing a bug where a new stray is getting created

before the old one has been properly removed, due to some bogus

assumption in the code that stray unlinks don't need to be persisted

as rigorously.

>

> I briefly ran ceph-mds with debug_mds=20/20 which didn't yield anything

> immediately useful, beyond slightly-easier-to-follow the control-flow of

> src/mds/CDir.cc without becoming much wiser.

> 2016-09-30 20:43:51.910754 7ffb653b8700 20 mds.0.cache.dir(606) _fetched pos

> 310473 marker 'I' dname '100022e8617 [2,head]

> 2016-09-30 20:43:51.910757 7ffb653b8700 20 mds.0.cache.dir(606) lookup

> (head, '100022e8617')

> 2016-09-30 20:43:51.910759 7ffb653b8700 20 mds.0.cache.dir(606)   miss ->

> (10002a81c10,head)

> 2016-09-30 20:43:51.910762 7ffb653b8700  0 mds.0.cache.dir(606) _fetched

> badness: got (but i already had) [inode 100022e8617 [...2,head]

> ~mds0/stray9/100022e8617 auth v39303851 s=11470 nl=10 n(v0 b11470 1=1+0)

> (iversion lock) 0x560c013904b8] mode 33188 mtime 2016-07-25 03:38:01.000000

> 2016-09-30 20:43:51.910772 7ffb653b8700 -1 log_channel(cluster) log [ERR] :

> loaded dup inode 100022e8617 [2,head] v39284583 at ~mds0/stray6/100022e8617,

> but inode 100022e8617.head v39303851 already exists at

> ~mds0/stray9/100022e8617

>

>

> 2016-09-25 06:23:50.947761 7ffb653b8700  1 mds.0.cache.dir(10003439a33)

> mismatch between head items and fnode.fragstat! printing dentries

> 2016-09-25 06:23:50.947779 7ffb653b8700  1 mds.0.cache.dir(10003439a33)

> get_num_head_items() = 36; fnode.fragstat.nfiles=53

> fnode.fragstat.nsubdirs=0

> 2016-09-25 06:23:50.947782 7ffb653b8700  1 mds.0.cache.dir(10003439a33)

> mismatch between child accounted_rstats and my rstats!

> 2016-09-25 06:23:50.947803 7ffb653b8700  1 mds.0.cache.dir(10003439a33)

> total of child dentrys: n(v0 b19365007 36=36+0)

> 2016-09-25 06:23:50.947806 7ffb653b8700  1 mds.0.cache.dir(10003439a33) my

> rstats:              n(v2 rc2016-08-28 04:48:37.685854 b49447206 53=53+0)

>

> The slightly sad thing is - I suspect all of this is probably from something

> that "happened at some time in the past", and running mds with debugging

> will make my users very unhappy as writing/formatting all that log is not

> exactly cheap. (debug_mds=20/20, quickly ended up with mds beacon marked as

> laggy).

>

> Bonus question: In terms of "understanding how cephfs works" is

> doc/dev/mds_internals it ? :) Given that making "minimal reproducible

> test-cases" so far is turning to be quite elusive from the "top down"

> approach, I'm finding myself looking inside the box to try to figure out how

> we got where we are.

There isn't a comprehensive set of up to date internals docs anywhere

unfortunately.  The original papers are still somewhat useful for a

high level view (http://ceph.com/papers/weil-ceph-osdi06.pdf) although

in the case of hard links in particular the mechanism has changed

completely since then.

However you should feel free to ask about any specific things (either

here or on IRC).

If you could narrow down any of these issues into reproducers it would

be extremely useful.

I'll let you know if/when we do :)

Cheers,
-- 
Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc
Phone: +1 (650) 739-6580

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com