Dear CephFS gurus
We are seeing an issue regarding CephFS performance which we are
trying to understand.
Our infrastructure is the following:
- We use CEPHFS (9.2.0)
- We have 3 mons and 8 storage servers supporting 8 OSDs each.
- We use SSDs for journals (2 SSDs per storage server, each
serving 4 OSDs).
- We have one main mds and one standby-replay mds.
- We are using ceph-fuse client to mount cephfs in an MPI
infrastructure (for now, we can not use the most recent kernels,
and therefore, can not use kernel module).
- Our (infiniband) MPI infrastructure consists of 40 hosts x 8
cores.
The current use case where we are seeing issues is the following:
- We have a user running an MPI application (using 280
cores) which is mostly CPU intensive.
- The only I/O which the user application does is the manipulation
of 8 log files per MPI rank, which in total, gives 280x8 = 2240
log files.
- All log files are under the same directory. The majority of the
logs files should be updated very frequently. In general, once
every 5 minutes or so, but it could be more frequent than that.
- The issue is that, although the application is running for quite
some time, some of the logs files are just updated at the very
beginning and the application seems to be blocked.
- Exploring lsof and strace (on several running processes), it
seems that MPI instances are just waiting on each other. This
points to the fact that there might be one or more MPI instances
delaying and preventing the application to proceed.
- Once we switched to a pure NFS setup, everything seems to be
fine, and the application progresses much further.
- When the user is running under CephFS, we really do not see any
errors or warnings in MDS or OSDs.
We are trying to understand why CEPHFS is not able to cope with the
user application and we have came up with two hypothesis:
a./ It might be that CEPHFS is not the most appropriate
infrastructure to support the frequent manipulation of too many
small files under the same directory.
b./ However, It could also be a locking issue, or a timing issue.
We have seen cases (outside of this use case) where a simple list
command of a directory with many files just hangs forever. We are
just wondering is something similar might be happening and how can
we actually prevent this. We think that the kernel client doesn't
have this problem due to using the kernel cache for inodes and we
were wondering is there is some cache tuning we can do with the
FUSE client.
Any suggestions / comments on either a./ or b./ would be
appreciated :-)
Cheers
Goncalo
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com