Re: Disabling POSIX locking semantics for CephFS

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Wed, 4 May 2016 10:51:12 +0200

Hi,

On 05/04/2016 09:15 AM, Yan, Zheng wrote:
On Wed, May 4, 2016 at 3:39 AM, Burkhard Linke
<Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
Hi,

On 03.05.2016 18:39, Gregory Farnum wrote:
On Tue, May 3, 2016 at 9:30 AM, Burkhard Linke
<Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
Hi,

we have a number of legacy applications that do not cope well with the
POSIX
locking semantics in CephFS due to missing locking support (e.g. flock
syscalls). We are able to fix some of these applications, but others are
binary only.

Is it possible to disable POSIX locking completely in CephFS (either
kernel
client or ceph-fuse)?
I'm confused. CephFS supports all of these — although some versions of
FUSE don't; you need a new-ish kernel.

So are you saying that
1) in your setup, it doesn't support both fcntl and flock,
2) that some of your applications don't do well under that scenario?

I don't really see how it's safe for you to just disable the
underlying file locking in an application which depends on it. You may
need to upgrade enough that all file locks are supported.

The application in question does a binary search in a large data file (~75
GB), which is stored on CephFS. It uses open and mmap without any further
locking controls (neither fcntl nor flock). The performance was very poor
with CephFS (Ubuntu Trusty 4.4 backport kernel from Xenial and ceph-fuse)
compared to the same application with a NFS based storage. I didn't had the
time to dig further into the kernel implementation yet, but I assume that
the root cause is locking pages accessed via the memory mapped file. Adding
a simple flock syscall for marking the data file globally as shared solved
the problem for us, reducing the overall runtime from nearly 2 hours to 5
minutes (and thus comparable to the NFS control case). The application runs
on our HPC cluster, so several 100 instances may access the same data file
at once.

We have other applications that were written without locking support and
that do not perform very well with CephFS. There was a thread in February
with a short discussion about CephFS mmap performance
(http://article.gmane.org/gmane.comp.file-systems.ceph.user/27501). As
pointed out in that thread, the problem is not only related to mmap itself,
but also to the need to implement a proper invalidation. We cannot fix this
for all our applications due to the lack of man power and the lack of source
code in some cases. We either have to find a way to make them work with
CephFS, or use a different setup, e.g. an extra NFS based mount point with a
re-export of CephFS. I would like to avoid the later solution...

Disabling the POSIX semantics and having a fallback to a more NFS-like
semantic without guarantees is a setback, but probably the easier way (if it
is possible at all). Most data accessed by these applications is read only,
so complex locking is not necessary in these cases.

see http://tracker.ceph.com/issues/15502. Maybe it's related to this issue.
We are using Ceph release 0.94.6, so the performance problems are 
probably not related. The page cache is also keep populated after an 
application terminates:

# dd if=test of=/dev/null
20971520+0 records in
20971520+0 records out
10737418240 bytes (11 GB) copied, 109.008 s, 98.5 MB/s
# dd if=test of=/dev/null
20971520+0 records in
20971520+0 records out
10737418240 bytes (11 GB) copied, 9.24535 s, 1.2 GB/s

How does CephFS handle locking in case of missing explicit locking 
control (e.g. flock / fcntl)? And what's the default of mmap'ed memory 
access in that case?

Regards,
Burkhard
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com