Re: Inconsistent directory content in cephfs

Paul Emmerich <paul.emmerich@xxxxxxxx> · Fri, 5 Oct 2018 13:48:13 +0200

Try running a scrub on that directory, that might yield more information.

ceph daemon mds.XXX scrub_path /path/in/cephfs recursive

Afterwards you can maybe try to repair it if it finds the error. Could
also be something completely different like a bug in the clients.

Paul
Am Fr., 5. Okt. 2018 um 12:57 Uhr schrieb Burkhard Linke
<Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>:
>
> Hi,
>
>
> a user just stumbled across a problem with directory content in cephfs
> (kernel client, ceph 12.2.8, one active, one standby-replay instance):
>
>
> root@host1:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
> 224
> root@host1:~# uname -a
> Linux host1 4.13.0-32-generic #35~16.04.1-Ubuntu SMP Thu Jan 25 10:13:43
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
>
> root@host2:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
> 224
> root@host2:~# uname -a
> Linux host2 4.15.0-32-generic #35~16.04.1-Ubuntu SMP Fri Aug 10 21:54:34
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
>
> root@host3:~# ls /ceph/sge-tmp/db/work/6c | wc -l
> 225
> root@host3:~# uname -a
> Linux host3 4.13.0-19-generic #22~16.04.1-Ubuntu SMP Mon Dec 4 15:35:18
> UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>
>
> Three hosts, different kernel versions, and one extra directory entry on
> the third host. All host used the same mount configuration:
>
> # mount | grep ceph
> <monitors>:/volumes on /ceph type ceph
> (rw,relatime,name=volumes,secret=<hidden>,acl,readdir_max_entries=8192,readdir_max_bytes=4104304)
>
> MDS logs only contain '2018-10-05 12:43:55.565598 7f2b7c578700  1
> mds.ceph-storage-04 Updating MDS map to version 325550 from mon.0' about
> every few minutes, with increasing version numbers. ceph -w also shows
> the following warnings:
>
> 2018-10-05 12:25:06.955085 mon.ceph-storage-03 [WRN] Health check
> failed: 2 clients failing to respond to cache pressure (MDS_CLIENT_RECALL)
> 2018-10-05 12:26:18.895358 mon.ceph-storage-03 [INF] MDS health message
> cleared (mds.0): Client host1:volumes failing to respond to cache pressure
> 2018-10-05 12:26:18.895401 mon.ceph-storage-03 [INF] MDS health message
> cleared (mds.0): Client cb-pc10:volumes failing to respond to cache pressure
> 2018-10-05 12:26:19.415890 mon.ceph-storage-03 [INF] Health check
> cleared: MDS_CLIENT_RECALL (was: 2 clients failing to respond to cache
> pressure)
> 2018-10-05 12:26:19.415919 mon.ceph-storage-03 [INF] Cluster is now healthy
>
> Timestamps of the MDS log messages and the messages about cache pressure
> are equal, so I assume that the MDS map has a list of failing clients
> and thus gets updated.
>
>
> But this does not explain the difference in the directory content. All
> entries are subdirectories. I also tried to enforce renewal of cached
> information by drop the kernel caches on the affected host, but to no
> avail yet. Caps on the MDS have dropped from 3.2 million to 800k, so
> dropping was effective.
>
>
> Any hints on the root cause for this problem? I've also tested various
> other clients....some show 224 entries, some 225.
>
>
> Regards,
>
> Burkhard
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com