CEPHFS file or directories disappear when ls (metadata problem)

FaHui Lin <fahui.lin@xxxxxxxxxx> · Thu, 24 Mar 2016 00:12:27 +0800



    Dear Ceph experts,

    
    We meet a nasty problem with our CephFS from time to time:

    
    When we try to list a directory under CephFS, some files or
    directories do not show up. For example:

    
    This is the complete directory content:

    # ll
        /cephfs/ies/home/mika

        drwxr-xr-x 1 10035 100001 1559018781 Feb  2 07:43 dir-A

        drwxr-xr-x 1 10035 100001    9061906 Apr 15  2015 dir-B

        -rw-r--r-- 1 10035 100001  130750361 Aug  6  2015 file-1

        -rw-r--r-- 1 10035 100001   72640608 Apr 15  2015 file-2

    
    But sometimes we get only part of files/directories when listing,
    say:

    # ll
        /cephfs/ies/home/mika

        drwxr-xr-x 1 10035 100001 1559018781 Feb  2 07:43 dir-A

        -rw-r--r-- 1 10035 100001   72640608 Apr 15  2015 file-2

    Here dir-B & file-1 missing.

    
    We found the files themselves are still intact since we and still
    see them on another node mounting the same cephfs, or just at
    another time. So we think this is a metadata problem.

    
    One thing we found interesting(?) is that remounting cephfs or
    restart MDS service will NOT help, but creating a new file under the
    directory may help:

    
    # ll /cephfs/ies/home/mika

    
      drwxr-xr-x 1 10035 100001 1559018781 Feb  2 07:43 dir-A

    
      -rw-r--r-- 1 10035 100001   72640608 Apr 15  2015 file-2

    
      # touch /cephfs/ies/home/mika/file-tmp

    # ll /cephfs/ies/home/mika

    
      drwxr-xr-x 1 10035 100001 1559018781 Feb  2 07:43 dir-A

    
      drwxr-xr-x 1 10035 100001    9061906 Apr 15  2015 dir-B

    
      -rw-r--r-- 1 10035 100001  130750361 Aug  6  2015 file-1

    
      -rw-r--r-- 1 10035 100001   72640608 Apr 15  2015 file-2

    -rw-r--r-- 1 root  root            0 Mar 23 15:34 file-tmp

    
    Strangely, when this happens, ceph cluster health usually shows
    HEALTH_OK, and there's no significant errors in MDS or other service
    logs.

    
    One thing we tried to improve is increasing MDS mds_cache_size to be
    1600000 (16x default value), which does help to alleviate warnings
    like "mds0: Client failing to respond to cache pressure", but still
    cannot solve the file metadata missing problem.

    
    Here's our ceph server info:

    
    # ceph -s

        cluster d15a2cdb-354c-4bcd-a246-23521f1a7122

         health HEALTH_OK

         monmap e1: 3 mons at
{as-ceph01=117.103.102.128:6789/0,as-ceph02=117.103.103.93:6789/0,as-ceph03=117.103.109.124:6789/0}

                election epoch 6, quorum 0,1,2
      as-ceph01,as-ceph02,as-ceph03

         mdsmap e144: 1/1/1 up {0=as-ceph02=up:active}, 1
      up:standby

         osdmap e178: 10 osds: 10 up, 10 in

                flags sortbitwise

          pgmap v105168: 256 pgs, 4 pools, 505 GB data, 1925
      kobjects

                1083 GB used, 399 TB / 400 TB avail

                     256 active+clean

      client io 614 B/s rd, 0 op/s

    
    # ceph --version

    ceph version 9.2.1
      (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)

    
    (We also met the same problem on Hammer release)

    
    # uname -r

    3.10.0-327.10.1.el7.x86_64

    
    We're using centOS7 servers.

    
    # ceph daemon mds.as-ceph02 perf dump

    {

        "mds": {

            "request": 76066,

            "reply": 76066,

            "reply_latency": {

                "avgcount": 76066,

                "sum": 61.151796797

            },

            "forward": 0,

            "dir_fetch": 1050,

            "dir_commit": 1017,

            "dir_split": 0,

            "inode_max": 1600000,

            "inodes": 130657,

            "inodes_top": 110882,

            "inodes_bottom": 19775,

            "inodes_pin_tail": 0,

            "inodes_pinned": 99670,

            "inodes_expired": 0,

            "inodes_with_caps": 99606,

            "caps": 105119,

            "subtrees": 2,

            "traverse": 81583,

            "traverse_hit": 74090,

            "traverse_forward": 0,

            "traverse_discover": 0,

            "traverse_dir_fetch": 24,

            "traverse_remote_ino": 0,

            "traverse_lock": 80,

            "load_cent": 7606600,

            "q": 0,

            "exported": 0,

            "exported_inodes": 0,

            "imported": 0,

            "imported_inodes": 0

        },

        "mds_cache": {

            "num_strays": 120,

            "num_strays_purging": 0,

            "num_strays_delayed": 0,

            "num_purge_ops": 0,

            "strays_created": 17276,

            "strays_purged": 17155,

            "strays_reintegrated": 1,

            "strays_migrated": 0,

            "num_recovering_processing": 0,

            "num_recovering_enqueued": 0,

            "num_recovering_prioritized": 0,

            "recovery_started": 0,

            "recovery_completed": 0

        },

        "mds_log": {

            "evadd": 116253,

            "evex": 123148,

            "evtrm": 123148,

            "ev": 22378,

            "evexg": 0,

            "evexd": 17,

            "segadd": 157,

            "segex": 157,

            "segtrm": 157,

            "seg": 31,

            "segexg": 0,

            "segexd": 1,

            "expos": 53624211952,

            "wrpos": 53709306372,

            "rdpos": 53354921818,

            "jlat": 0

        },

        "mds_mem": {

            "ino": 129334,

            "ino+": 146489,

            "ino-": 17155,

            "dir": 3961,

            "dir+": 4741,

            "dir-": 780,

            "dn": 130657,

            "dn+": 163760,

            "dn-": 33103,

            "cap": 105119,

            "cap+": 122281,

            "cap-": 17162,

            "rss": 444444,

            "heap": 50108,

            "malloc": 402511,

            "buf": 0

        },

        "mds_server": {

            "handle_client_request": 76066,

            "handle_slave_request": 0,

            "handle_client_session": 176954,

            "dispatch_client_request": 80245,

            "dispatch_server_request": 0

        },

        "objecter": {

            "op_active": 0,

            "op_laggy": 0,

            "op_send": 61860,

            "op_send_bytes": 0,

            "op_resend": 0,

            "op_ack": 7719,

            "op_commit": 54141,

            "op": 61860,

            "op_r": 7719,

            "op_w": 54141,

            "op_rmw": 0,

            "op_pg": 0,

            "osdop_stat": 119,

            "osdop_create": 26905,

            "osdop_read": 21,

            "osdop_write": 8537,

            "osdop_writefull": 254,

            "osdop_append": 0,

            "osdop_zero": 1,

            "osdop_truncate": 0,

            "osdop_delete": 17325,

            "osdop_mapext": 0,

            "osdop_sparse_read": 0,

            "osdop_clonerange": 0,

            "osdop_getxattr": 7695,

            "osdop_setxattr": 53810,

            "osdop_cmpxattr": 0,

            "osdop_rmxattr": 0,

            "osdop_resetxattrs": 0,

            "osdop_tmap_up": 0,

            "osdop_tmap_put": 0,

            "osdop_tmap_get": 0,

            "osdop_call": 0,

            "osdop_watch": 0,

            "osdop_notify": 0,

            "osdop_src_cmpxattr": 0,

            "osdop_pgls": 0,

            "osdop_pgls_filter": 0,

            "osdop_other": 1111,

            "linger_active": 0,

            "linger_send": 0,

            "linger_resend": 0,

            "linger_ping": 0,

            "poolop_active": 0,

            "poolop_send": 0,

            "poolop_resend": 0,

            "poolstat_active": 0,

            "poolstat_send": 0,

            "poolstat_resend": 0,

            "statfs_active": 0,

            "statfs_send": 0,

            "statfs_resend": 0,

            "command_active": 0,

            "command_send": 0,

            "command_resend": 0,

            "map_epoch": 178,

            "map_full": 0,

            "map_inc": 3,

            "osd_sessions": 55,

            "osd_session_open": 182,

            "osd_session_close": 172,

            "osd_laggy": 0,

            "omap_wr": 1972,

            "omap_rd": 2102,

            "omap_del": 40

        },

        "throttle-msgr_dispatch_throttler-mds": {

            "val": 0,

            "max": 104857600,

            "get": 450630,

            "get_sum": 135500995,

            "get_or_fail_fail": 0,

            "get_or_fail_success": 0,

            "take": 0,

            "take_sum": 0,

            "put": 450630,

            "put_sum": 135500995,

            "wait": {

                "avgcount": 0,

                "sum": 0.000000000

            }

        },

        "throttle-objecter_bytes": {

            "val": 0,

            "max": 104857600,

            "get": 0,

            "get_sum": 0,

            "get_or_fail_fail": 0,

            "get_or_fail_success": 0,

            "take": 61860,

            "take_sum": 453992030,

            "put": 44433,

            "put_sum": 453992030,

            "wait": {

                "avgcount": 0,

                "sum": 0.000000000

            }

        },

        "throttle-objecter_ops": {

            "val": 0,

            "max": 1024,

            "get": 0,

            "get_sum": 0,

            "get_or_fail_fail": 0,

            "get_or_fail_success": 0,

            "take": 61860,

            "take_sum": 61860,

            "put": 61860,

            "put_sum": 61860,

            "wait": {

                "avgcount": 0,

                "sum": 0.000000000

            }

        }

    }

    
    This problem troubles us much since our cephfs is serving as a
    network shared file-system of 100+ computing nodes (mounting with
    mount.ceph), and it causes jobs running with I/O on cephfs to fail.

    
    I'd like to ask that:

    
    1) What could be the main cause of this problem? Or, how can we
    trace the problem?

    However, we cannot really reproduce the problem on purpose. It just
    happens occasionally.

    
    2) Since our cephfs is now for production usage, is there any
    comment for us to improve the stability?

    We have 100+ computing nodes requiring a shared file-system
    containing tens of millions of files and I wonder if the MDS server
    (only one) could handle them well.

    Should we use ceph-fuse mount or ceph mount? Should we use only 3~5
    servers mounting cephfs and then share the mountpoint to other nodes
    with NFS, in order to mitigate the loading of MDS server? What is a
    proper cluster structure using cephfs?

    
    Any advice or comment will be appreciated. Thank you.

    
    Best Regards,

    FaHui

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com