Hi,
a user just stumbled across a problem with directory content in cephfs
(kernel client, ceph 12.2.8, one active, one standby-replay instance):
root@host1:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
224
root@host1:~# uname -a
Linux host1 4.13.0-32-generic #35~16.04.1-Ubuntu SMP Thu Jan 25 10:13:43
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
root@host2:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
224
root@host2:~# uname -a
Linux host2 4.15.0-32-generic #35~16.04.1-Ubuntu SMP Fri Aug 10 21:54:34
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
root@host3:~# ls /ceph/sge-tmp/db/work/6c | wc -l
225
root@host3:~# uname -a
Linux host3 4.13.0-19-generic #22~16.04.1-Ubuntu SMP Mon Dec 4 15:35:18
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Three hosts, different kernel versions, and one extra directory entry on
the third host. All host used the same mount configuration:
# mount | grep ceph
<monitors>:/volumes on /ceph type ceph
(rw,relatime,name=volumes,secret=<hidden>,acl,readdir_max_entries=8192,readdir_max_bytes=4104304)
MDS logs only contain '2018-10-05 12:43:55.565598 7f2b7c578700 1
mds.ceph-storage-04 Updating MDS map to version 325550 from mon.0' about
every few minutes, with increasing version numbers. ceph -w also shows
the following warnings:
2018-10-05 12:25:06.955085 mon.ceph-storage-03 [WRN] Health check
failed: 2 clients failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-10-05 12:26:18.895358 mon.ceph-storage-03 [INF] MDS health message
cleared (mds.0): Client host1:volumes failing to respond to cache pressure
2018-10-05 12:26:18.895401 mon.ceph-storage-03 [INF] MDS health message
cleared (mds.0): Client cb-pc10:volumes failing to respond to cache pressure
2018-10-05 12:26:19.415890 mon.ceph-storage-03 [INF] Health check
cleared: MDS_CLIENT_RECALL (was: 2 clients failing to respond to cache
pressure)
2018-10-05 12:26:19.415919 mon.ceph-storage-03 [INF] Cluster is now healthy
Timestamps of the MDS log messages and the messages about cache pressure
are equal, so I assume that the MDS map has a list of failing clients
and thus gets updated.
But this does not explain the difference in the directory content. All
entries are subdirectories. I also tried to enforce renewal of cached
information by drop the kernel caches on the affected host, but to no
avail yet. Caps on the MDS have dropped from 3.2 million to 800k, so
dropping was effective.
Any hints on the root cause for this problem? I've also tested various
other clients....some show 224 entries, some 225.
Regards,
Burkhard
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com