CephFS: Issues handling thousands of files under the same dir (?)

Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> · Mon, 18 Apr 2016 11:57:34 +1000



    Dear CephFS gurus

    
    We are seeing an issue regarding CephFS performance which we are
    trying to understand.

    
    Our infrastructure is the following:

    - We use CEPHFS (9.2.0)

      - We have 3 mons and 8 storage servers supporting 8 OSDs each. 

      - We use SSDs for journals (2 SSDs per storage server, each
      serving 4 OSDs).

      - We have one main mds and one standby-replay mds.

      - We are using ceph-fuse client to mount cephfs in an MPI
      infrastructure (for now, we can not use the most recent kernels,
      and therefore, can not use kernel module).

      - Our (infiniband) MPI infrastructure consists of 40 hosts x 8
      cores.

    
    The current use case where we are seeing issues is the following:

    - We have a user running an MPI application (using 280
      cores) which is mostly CPU intensive.

      - The only I/O which the user application does is the manipulation
      of 8 log files per MPI rank, which in total, gives 280x8 = 2240
      log files.

      - All log files are under the same directory. The majority of the
      logs files should be updated very frequently. In general, once
      every 5 minutes or so, but it could be more frequent than that.

      - The issue is that, although the application is running for quite
      some time, some of the logs files are just updated at the very
      beginning and the application seems to be blocked.

      - Exploring lsof and strace (on several running processes), it
      seems that MPI instances are just waiting on each other. This
      points to the fact that there might be one or more MPI instances
      delaying and preventing the application to proceed.

      - Once we switched to a pure NFS setup, everything seems to be
      fine, and the application progresses much further.

      - When the user is running under CephFS, we really do not see any
      errors or warnings in MDS or OSDs. 

    
    We are trying to understand why CEPHFS is not able to cope with the
    user application and we have came up with two hypothesis:

    a./ It might be that CEPHFS is not the most appropriate
      infrastructure to support the frequent manipulation of too many
      small files under the same directory. 

      
      b./ However, It could also be a locking issue, or a timing issue.
      We have seen cases (outside of this use case) where a simple list
      command of a directory with many files just hangs forever. We are
      just wondering is something similar might be happening and how can
      we actually prevent this. We think that the kernel client doesn't
      have this problem due to using the kernel cache for inodes and we
      were wondering is there is some cache tuning we can do with the
      FUSE client. 

      
    Any suggestions /  comments on either a./ or b./  would be
    appreciated :-)

    
    Cheers

    Goncalo

     
    -- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com