CephFS: Issues handling thousands of files under the same dir (?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear CephFS gurus

We are seeing an issue regarding CephFS performance which we are trying to understand.

Our infrastructure is the following:
- We use CEPHFS (9.2.0)
- We have 3 mons and 8 storage servers supporting 8 OSDs each.
- We use SSDs for journals (2 SSDs per storage server, each serving 4 OSDs).
- We have one main mds and one standby-replay mds.
- We are using ceph-fuse client to mount cephfs in an MPI infrastructure (for now, we can not use the most recent kernels, and therefore, can not use kernel module).
- Our (infiniband) MPI infrastructure consists of 40 hosts x 8 cores.
The current use case where we are seeing issues is the following:
- We have a user running an MPI application (using 280 cores) which is mostly CPU intensive.
- The only I/O which the user application does is the manipulation of 8 log files per MPI rank, which in total, gives 280x8 = 2240 log files.
- All log files are under the same directory. The majority of the logs files should be updated very frequently. In general, once every 5 minutes or so, but it could be more frequent than that.
- The issue is that, although the application is running for quite some time, some of the logs files are just updated at the very beginning and the application seems to be blocked.
- Exploring lsof and strace (on several running processes), it seems that MPI instances are just waiting on each other. This points to the fact that there might be one or more MPI instances delaying and preventing the application to proceed.
- Once we switched to a pure NFS setup, everything seems to be fine, and the application progresses much further.
- When the user is running under CephFS, we really do not see any errors or warnings in MDS or OSDs.
We are trying to understand why CEPHFS is not able to cope with the user application and we have came up with two hypothesis:
a./ It might be that CEPHFS is not the most appropriate infrastructure to support the frequent manipulation of too many small files under the same directory.

b./ However, It could also be a locking issue, or a timing issue. We have seen cases (outside of this use case) where a simple list command of a directory with many files just hangs forever. We are just wondering is something similar might be happening and how can we actually prevent this. We think that the kernel client doesn't have this problem due to using the kernel cache for inodes and we were wondering is there is some cache tuning we can do with the FUSE client.

Any suggestions /  comments on either a./ or b./  would be appreciated :-)

Cheers
Goncalo
 

-- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux