On Mon, Apr 18, 2016 at 9:57 AM, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> wrote: > Dear CephFS gurus > > We are seeing an issue regarding CephFS performance which we are trying to > understand. > > Our infrastructure is the following: > > - We use CEPHFS (9.2.0) > - We have 3 mons and 8 storage servers supporting 8 OSDs each. > - We use SSDs for journals (2 SSDs per storage server, each serving 4 OSDs). > - We have one main mds and one standby-replay mds. > - We are using ceph-fuse client to mount cephfs in an MPI infrastructure > (for now, we can not use the most recent kernels, and therefore, can not use > kernel module). > - Our (infiniband) MPI infrastructure consists of 40 hosts x 8 cores. > > The current use case where we are seeing issues is the following: > > - We have a user running an MPI application (using 280 cores) which is > mostly CPU intensive. > - The only I/O which the user application does is the manipulation of 8 log > files per MPI rank, which in total, gives 280x8 = 2240 log files. > - All log files are under the same directory. The majority of the logs files > should be updated very frequently. In general, once every 5 minutes or so, > but it could be more frequent than that. > - The issue is that, although the application is running for quite some > time, some of the logs files are just updated at the very beginning and the > application seems to be blocked. > - Exploring lsof and strace (on several running processes), it seems that > MPI instances are just waiting on each other. This points to the fact that > there might be one or more MPI instances delaying and preventing the > application to proceed. > - Once we switched to a pure NFS setup, everything seems to be fine, and the > application progresses much further. > - When the user is running under CephFS, we really do not see any errors or > warnings in MDS or OSDs. > > We are trying to understand why CEPHFS is not able to cope with the user > application and we have came up with two hypothesis: > > a./ It might be that CEPHFS is not the most appropriate infrastructure to > support the frequent manipulation of too many small files under the same > directory. > > b./ However, It could also be a locking issue, or a timing issue. We have > seen cases (outside of this use case) where a simple list command of a > directory with many files just hangs forever. We are just wondering is > something similar might be happening and how can we actually prevent this. > We think that the kernel client doesn't have this problem due to using the > kernel cache for inodes and we were wondering is there is some cache tuning > we can do with the FUSE client. Please run 'ceph daemon mds.xxx dump_ops_in_flight' a few times while running the MPI application, save the outputs and send them to us. Hopefully, they will give us hints where does the application block. Regards Yan, Zheng > > Any suggestions / comments on either a./ or b./ would be appreciated :-) > > Cheers > Goncalo > > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T: +61 2 93511937 > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com