Re: CephFS: Issues handling thousands of files under the same dir (?)

"Yan, Zheng" <ukernel@xxxxxxxxx> · Mon, 18 Apr 2016 11:13:20 +0800

On Mon, Apr 18, 2016 at 9:57 AM, Goncalo Borges
<goncalo.borges@xxxxxxxxxxxxx> wrote:
> Dear CephFS gurus
>
> We are seeing an issue regarding CephFS performance which we are trying to
> understand.
>
> Our infrastructure is the following:
>
> - We use CEPHFS (9.2.0)
> - We have 3 mons and 8 storage servers supporting 8 OSDs each.
> - We use SSDs for journals (2 SSDs per storage server, each serving 4 OSDs).
> - We have one main mds and one standby-replay mds.
> - We are using ceph-fuse client to mount cephfs in an MPI infrastructure
> (for now, we can not use the most recent kernels, and therefore, can not use
> kernel module).
> - Our (infiniband) MPI infrastructure consists of 40 hosts x 8 cores.
>
> The current use case where we are seeing issues is the following:
>
> - We have a user running an MPI application (using 280 cores) which is
> mostly CPU intensive.
> - The only I/O which the user application does is the manipulation of 8 log
> files per MPI rank, which in total, gives 280x8 = 2240 log files.
> - All log files are under the same directory. The majority of the logs files
> should be updated very frequently. In general, once every 5 minutes or so,
> but it could be more frequent than that.
> - The issue is that, although the application is running for quite some
> time, some of the logs files are just updated at the very beginning and the
> application seems to be blocked.
> - Exploring lsof and strace (on several running processes), it seems that
> MPI instances are just waiting on each other. This points to the fact that
> there might be one or more MPI instances delaying and preventing the
> application to proceed.
> - Once we switched to a pure NFS setup, everything seems to be fine, and the
> application progresses much further.
> - When the user is running under CephFS, we really do not see any errors or
> warnings in MDS or OSDs.
>
> We are trying to understand why CEPHFS is not able to cope with the user
> application and we have came up with two hypothesis:
>
> a./ It might be that CEPHFS is not the most appropriate infrastructure to
> support the frequent manipulation of too many small files under the same
> directory.
>
> b./ However, It could also be a locking issue, or a timing issue. We have
> seen cases (outside of this use case) where a simple list command of a
> directory with many files just hangs forever. We are just wondering is
> something similar might be happening and how can we actually prevent this.
> We think that the kernel client doesn't have this problem due to using the
> kernel cache for inodes and we were wondering is there is some cache tuning
> we can do with the FUSE client.

Please run 'ceph daemon mds.xxx dump_ops_in_flight' a few times while
running the MPI application, save the outputs and send them to us.
Hopefully, they will give us hints where does the application block.

Regards
Yan, Zheng

>
> Any suggestions /  comments on either a./ or b./  would be appreciated :-)
>
> Cheers
> Goncalo
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com