Re: CEPHFS file or directories disappear when ls (metadata problem)

FaHui Lin <fahui.lin@xxxxxxxxxx> · Thu, 24 Mar 2016 11:13:42 +0800

Dear Greg, Lincoln, and all,

Thank you for your suggestion.

I think I'll leave the kernel intact and use ceph-fuse on most computing 
nodes, considering the compatibility of other software used by our jobs. 
Hope that will improve the stability. For sure, we'll try a few nodes 
with newer kernel for experiment.

BTW, here's the result of a simple performance test with kernel mount 
and fuse mount on our CephFS, FYI.
We have 10GbE net work, and I just ran "cat" command to read/write a 
single 4GB-file on a client node (e.g. "time -p cat /cephfs/a4GBfile > 
/dev/null"):

Kernel mount: read 600~700 MB/s, write 800~900 MB/s
Fuse mount: read 100~200 MB/s, write 200~300 MB/s

Best Regards,
FaHui

Gregory Farnum 於 2016/3/24 上午 01:02 寫道:
On Wed, Mar 23, 2016 at 9:18 AM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote:
Hi,

If you are using the kernel client, I would suggest trying something newer
than 3.10.x. I ran into this issue in the past, but it was fixed by updating
my kernel to something newer. You may want to check the OS recommendations
page as well: http://docs.ceph.com/docs/master/start/os-recommendations/
I presume that's actually the CentOS7 kernel, but I'm not sure which
point release it might be. In any case, an upcoming el7 kernel will
have the up-to-date CephFS client backported onto it but I don't think
any of the currently-released ones do. So I suspect this is indeed one
of the known and fixed issues. ceph-fuse runs into that kind of issue
a lot less often than the kernel client does, and can be upgraded more
easily, but it has different performance characteristics that you may
or may not care about. (They aren't well-quantified, just definitely
different — it runs through FUSE, but gets optimizations first, etc.)

Regarding mounting 100+ clients, it shouldn't be a problem for the
MDS. If you do run into issues, please report them!
-Greg

ELRepo maintains mainline RPMs for EL6 and EL7:
http://elrepo.org/tiki/kernel-ml

Alternatively, you could try the FUSE client.

—Lincoln

On Mar 23, 2016, at 11:12 AM, FaHui Lin <fahui.lin@xxxxxxxxxx> wrote:

Dear Ceph experts,

We meet a nasty problem with our CephFS from time to time:

When we try to list a directory under CephFS, some files or directories do
not show up. For example:

This is the complete directory content:
# ll /cephfs/ies/home/mika
drwxr-xr-x 1 10035 100001 1559018781 Feb  2 07:43 dir-A
drwxr-xr-x 1 10035 100001    9061906 Apr 15  2015 dir-B
-rw-r--r-- 1 10035 100001  130750361 Aug  6  2015 file-1
-rw-r--r-- 1 10035 100001   72640608 Apr 15  2015 file-2

But sometimes we get only part of files/directories when listing, say:
# ll /cephfs/ies/home/mika
drwxr-xr-x 1 10035 100001 1559018781 Feb  2 07:43 dir-A
-rw-r--r-- 1 10035 100001   72640608 Apr 15  2015 file-2
Here dir-B & file-1 missing.

We found the files themselves are still intact since we and still see them
on another node mounting the same cephfs, or just at another time. So we
think this is a metadata problem.

One thing we found interesting(?) is that remounting cephfs or restart MDS
service will NOT help, but creating a new file under the directory may help:

# ll /cephfs/ies/home/mika
drwxr-xr-x 1 10035 100001 1559018781 Feb  2 07:43 dir-A
-rw-r--r-- 1 10035 100001   72640608 Apr 15  2015 file-2
# touch /cephfs/ies/home/mika/file-tmp
# ll /cephfs/ies/home/mika
drwxr-xr-x 1 10035 100001 1559018781 Feb  2 07:43 dir-A
drwxr-xr-x 1 10035 100001    9061906 Apr 15  2015 dir-B
-rw-r--r-- 1 10035 100001  130750361 Aug  6  2015 file-1
-rw-r--r-- 1 10035 100001   72640608 Apr 15  2015 file-2
-rw-r--r-- 1 root  root            0 Mar 23 15:34 file-tmp

Strangely, when this happens, ceph cluster health usually shows HEALTH_OK,
and there's no significant errors in MDS or other service logs.

One thing we tried to improve is increasing MDS mds_cache_size to be 1600000
(16x default value), which does help to alleviate warnings like "mds0:
Client failing to respond to cache pressure", but still cannot solve the
file metadata missing problem.

Here's our ceph server info:

# ceph -s
     cluster d15a2cdb-354c-4bcd-a246-23521f1a7122
      health HEALTH_OK
      monmap e1: 3 mons at
{as-ceph01=117.103.102.128:6789/0,as-ceph02=117.103.103.93:6789/0,as-ceph03=117.103.109.124:6789/0}
             election epoch 6, quorum 0,1,2 as-ceph01,as-ceph02,as-ceph03
      mdsmap e144: 1/1/1 up {0=as-ceph02=up:active}, 1 up:standby
      osdmap e178: 10 osds: 10 up, 10 in
             flags sortbitwise
       pgmap v105168: 256 pgs, 4 pools, 505 GB data, 1925 kobjects
             1083 GB used, 399 TB / 400 TB avail
                  256 active+clean
   client io 614 B/s rd, 0 op/s

# ceph --version
ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)

(We also met the same problem on Hammer release)

# uname -r
3.10.0-327.10.1.el7.x86_64

We're using centOS7 servers.

# ceph daemon mds.as-ceph02 perf dump
{
     "mds": {
         "request": 76066,
         "reply": 76066,
         "reply_latency": {
             "avgcount": 76066,
             "sum": 61.151796797
         },
         "forward": 0,
         "dir_fetch": 1050,
         "dir_commit": 1017,
         "dir_split": 0,
         "inode_max": 1600000,
         "inodes": 130657,
         "inodes_top": 110882,
         "inodes_bottom": 19775,
         "inodes_pin_tail": 0,
         "inodes_pinned": 99670,
         "inodes_expired": 0,
         "inodes_with_caps": 99606,
         "caps": 105119,
         "subtrees": 2,
         "traverse": 81583,
         "traverse_hit": 74090,
         "traverse_forward": 0,
         "traverse_discover": 0,
         "traverse_dir_fetch": 24,
         "traverse_remote_ino": 0,
         "traverse_lock": 80,
         "load_cent": 7606600,
         "q": 0,
         "exported": 0,
         "exported_inodes": 0,
         "imported": 0,
         "imported_inodes": 0
     },
     "mds_cache": {
         "num_strays": 120,
         "num_strays_purging": 0,
         "num_strays_delayed": 0,
         "num_purge_ops": 0,
         "strays_created": 17276,
         "strays_purged": 17155,
         "strays_reintegrated": 1,
         "strays_migrated": 0,
         "num_recovering_processing": 0,
         "num_recovering_enqueued": 0,
         "num_recovering_prioritized": 0,
         "recovery_started": 0,
         "recovery_completed": 0
     },
     "mds_log": {
         "evadd": 116253,
         "evex": 123148,
         "evtrm": 123148,
         "ev": 22378,
         "evexg": 0,
         "evexd": 17,
         "segadd": 157,
         "segex": 157,
         "segtrm": 157,
         "seg": 31,
         "segexg": 0,
         "segexd": 1,
         "expos": 53624211952,
         "wrpos": 53709306372,
         "rdpos": 53354921818,
         "jlat": 0
     },
     "mds_mem": {
         "ino": 129334,
         "ino+": 146489,
         "ino-": 17155,
         "dir": 3961,
         "dir+": 4741,
         "dir-": 780,
         "dn": 130657,
         "dn+": 163760,
         "dn-": 33103,
         "cap": 105119,
         "cap+": 122281,
         "cap-": 17162,
         "rss": 444444,
         "heap": 50108,
         "malloc": 402511,
         "buf": 0
     },
     "mds_server": {
         "handle_client_request": 76066,
         "handle_slave_request": 0,
         "handle_client_session": 176954,
         "dispatch_client_request": 80245,
         "dispatch_server_request": 0
     },
     "objecter": {
         "op_active": 0,
         "op_laggy": 0,
         "op_send": 61860,
         "op_send_bytes": 0,
         "op_resend": 0,
         "op_ack": 7719,
         "op_commit": 54141,
         "op": 61860,
         "op_r": 7719,
         "op_w": 54141,
         "op_rmw": 0,
         "op_pg": 0,
         "osdop_stat": 119,
         "osdop_create": 26905,
         "osdop_read": 21,
         "osdop_write": 8537,
         "osdop_writefull": 254,
         "osdop_append": 0,
         "osdop_zero": 1,
         "osdop_truncate": 0,
         "osdop_delete": 17325,
         "osdop_mapext": 0,
         "osdop_sparse_read": 0,
         "osdop_clonerange": 0,
         "osdop_getxattr": 7695,
         "osdop_setxattr": 53810,
         "osdop_cmpxattr": 0,
         "osdop_rmxattr": 0,
         "osdop_resetxattrs": 0,
         "osdop_tmap_up": 0,
         "osdop_tmap_put": 0,
         "osdop_tmap_get": 0,
         "osdop_call": 0,
         "osdop_watch": 0,
         "osdop_notify": 0,
         "osdop_src_cmpxattr": 0,
         "osdop_pgls": 0,
         "osdop_pgls_filter": 0,
         "osdop_other": 1111,
         "linger_active": 0,
         "linger_send": 0,
         "linger_resend": 0,
         "linger_ping": 0,
         "poolop_active": 0,
         "poolop_send": 0,
         "poolop_resend": 0,
         "poolstat_active": 0,
         "poolstat_send": 0,
         "poolstat_resend": 0,
         "statfs_active": 0,
         "statfs_send": 0,
         "statfs_resend": 0,
         "command_active": 0,
         "command_send": 0,
         "command_resend": 0,
         "map_epoch": 178,
         "map_full": 0,
         "map_inc": 3,
         "osd_sessions": 55,
         "osd_session_open": 182,
         "osd_session_close": 172,
         "osd_laggy": 0,
         "omap_wr": 1972,
         "omap_rd": 2102,
         "omap_del": 40
     },
     "throttle-msgr_dispatch_throttler-mds": {
         "val": 0,
         "max": 104857600,
         "get": 450630,
         "get_sum": 135500995,
         "get_or_fail_fail": 0,
         "get_or_fail_success": 0,
         "take": 0,
         "take_sum": 0,
         "put": 450630,
         "put_sum": 135500995,
         "wait": {
             "avgcount": 0,
             "sum": 0.000000000
         }
     },
     "throttle-objecter_bytes": {
         "val": 0,
         "max": 104857600,
         "get": 0,
         "get_sum": 0,
         "get_or_fail_fail": 0,
         "get_or_fail_success": 0,
         "take": 61860,
         "take_sum": 453992030,
         "put": 44433,
         "put_sum": 453992030,
         "wait": {
             "avgcount": 0,
             "sum": 0.000000000
         }
     },
     "throttle-objecter_ops": {
         "val": 0,
         "max": 1024,
         "get": 0,
         "get_sum": 0,
         "get_or_fail_fail": 0,
         "get_or_fail_success": 0,
         "take": 61860,
         "take_sum": 61860,
         "put": 61860,
         "put_sum": 61860,
         "wait": {
             "avgcount": 0,
             "sum": 0.000000000
         }
     }
}

This problem troubles us much since our cephfs is serving as a network
shared file-system of 100+ computing nodes (mounting with mount.ceph), and
it causes jobs running with I/O on cephfs to fail.

I'd like to ask that:

1) What could be the main cause of this problem? Or, how can we trace the
problem?
However, we cannot really reproduce the problem on purpose. It just happens
occasionally.

2) Since our cephfs is now for production usage, is there any comment for us
to improve the stability?
We have 100+ computing nodes requiring a shared file-system containing tens
of millions of files and I wonder if the MDS server (only one) could handle
them well.
Should we use ceph-fuse mount or ceph mount? Should we use only 3~5 servers
mounting cephfs and then share the mountpoint to other nodes with NFS, in
order to mitigate the loading of MDS server? What is a proper cluster
structure using cephfs?

Any advice or comment will be appreciated. Thank you.

Best Regards,
FaHui

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com