Hi all,
with our Ceph Luminous CephFS, we're plaqued with "failed to open ino"
messages. These don't seem to affect daily business (in terms of "file
access"). (There's a backup performance issue that may eventually be
related, but I'll report on that in a different thread.)
Our Ceph currently is at v12.2.1 (git.1507910930.aea79b8b7a), on
OpenSUSE Leap 42.3. Three Ceph nodes, 12 HDD OSDs, two SSD OSDs,
status "HEALTH_OK".
We have a single CephFS and two MDS (active/standby), metadata pool is
on SSD OSDs, content is on HDD OSDs (all file stores). That CephFS is
mounted by several clients (via kernel cephfs support, mostly kernel
version 4.4.76) and via NFS (kernel nfsd on a kernel-mounted CephFS).
In the log of the active MDS, we currently see the following two
inodes reported over and over again, about every 30 seconds:
--- cut here ---
2017-11-24 18:24:16.496397 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e45e1d err -22/0
2017-11-24 18:24:16.497037 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e4d6a1 err -22/-22
2017-11-24 18:24:16.500645 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e45e1d err -22/0
2017-11-24 18:24:16.501218 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e4d6a1 err -22/-22
2017-11-24 18:24:46.506210 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e45e1d err -22/0
2017-11-24 18:24:46.506926 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e4d6a1 err -22/-22
2017-11-24 18:24:46.510354 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e45e1d err -22/0
2017-11-24 18:24:46.510891 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e4d6a1 err -22/-22
--- cut here ---
There were other reported inodes with other errors, too ("err -5/0",
for instance), the root cause seems to be the same (see below).
For the 0x10001e4d6a1 inode, it was first mentioned in the MDS log as
follows, and then every 30 seconds as "failed to open". 0x10001e45e1d
just appeared at the same time, "out of the blue":
--- cut here ---
2017-11-23 19:12:28.440107 7f3586cbd700 0 mds.0.bal replicating dir
[dir 0x10001e4d6a1 /some/path/in/our/cephfs/ [2,head] auth pv=3473
v=3471 cv=2963/2963 ap=1+6+6 state=1610612738|complete f(v0
m2017-11-23 19:12:25.005299 15=4+11) n(v74 rc2017-11-23
19:12:28.429258 b137317605 5935=4875+1060)/n(v74 rc2017-11-23
19:12:28.337259 b139723963 5969=4898+1071) hs=15+23,ss=0+0 dirty=30 |
child=1 dirty=1 authpin=1 0x5649e811e9c0] pop 10223.1 .. rdp 373 adj 0
2017-11-23 19:15:55.015347 7f3580cb1700 0 mds.0.cache failed to open
ino 0x10001e45e1d err -22/0
2017-11-23 19:15:55.016056 7f3580cb1700 0 mds.0.cache failed to open
ino 0x10001e4d6a1 err -22/-22
--- cut here ---
According to a message from this ML, "replicating dir" is supposed to
indicate that the directory is hot and being replicated to another
active MDS to spread the load. But there is no other active MDS, as we
have only two in total (and Ceph properly reports "mds: cephfs-1/1/1
up {0=server1=up:active}, 1 up:standby-replay".
From what I took from the logs, all error messages seem to be related
to that same path "/some/path/in/our/cephfs/" all the time, which gets
deleted and recreated every evening - thus changing ino numbers. But
what can it be that makes that same path, buried far down in some
cephfs hierarchy, cause these "failed to open ino" messages? It's just
one of a lot of directories, and no particularly populated one (about
40 files and directories).
Another possibly interesting side fact: The previous ino's messages
sometimes stop when our evening/night job ends (no more active
accesses to that directory), sometimes they continue throughout the
day. I've even noticed it crossing the directory recreation time
(messages for old inode don't stop when the old directory is deleted,
except for after a new "replicating dir" message appears). Today I
deleted the directory (including its parent) during the day, but the
messages wouldn't stop. The messages also survive switching to the
other MDS. And by now the next daily instance of our job has started,
recreating the directories (that I had deleted throughout the day).
The "failed to open ino" messages for the old ino numbers still
appears every 30 seconds, I've not yet seen any "replicating dir"
message for any of that cephfs tree area. I have seen a few for other
areas of the cephfs tree, but no other ino numbers were reported as
"failed to open" - only the two from above:
--- cut here ---
2017-11-24 19:22:13.000050 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e45e1d err -22/0
2017-11-24 19:22:13.000770 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e4d6a1 err -22/-22
2017-11-24 19:22:13.003918 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e45e1d err -22/0
2017-11-24 19:22:13.004469 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e4d6a1 err -22/-22
2017-11-24 19:22:18.635372 7fa30ecfc700 0 mds.0.bal replicating dir
[dir 0x1000113bcf8 /some/path/somewhere/else [2,head] auth v=101522
cv=101423/101423 ap=0+1+3 state=1610612738|complete f(v0 m2017-11-12
05:28:27.337616 95=0+95) n(v5624 rc2017-11-24 19:22:18.483203
b242650331 6165=4415+1750)/n(v5624 rc2017-11-24 19:22:18.147208
b243468738 6165=4415+1750) hs=95+0,ss=0+0 dirty=7 | child=1 dirty=1
waiter=0 authpin=0 0x55754c144000] pop 8182.78 .. rdp 102 adj 0
2017-11-24 19:22:43.008134 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e45e1d err -22/0
2017-11-24 19:22:43.008768 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e4d6a1 err -22/-22
2017-11-24 19:22:43.011780 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e45e1d err -22/0
2017-11-24 19:22:43.012355 7fa308cf0700 0 mds.0.cache failed to open
ino 0x10001e4d6a1 err -22/-22
--- cut here ---
The affected directory looked fine via "ls -l" and I would be happy if
you'd share any ideas of handling this situation.
Regards,
Jens
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com