On Thu, Jul 20, 2017 at 9:19 PM, Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
We are having some difficulties with cephfs access to the same file
from
multiple nodes concurrently. After debugging some large-ish
applications
with noticeable performance problems using CephFS (with the fuse
client), I
have a small test program to reproduce the problem.
The core of the problem boils down to the following operation being
run on
the same file on multiple nodes (in a loop in the test program):
int fd = open(filename, mode);
read(fd, buffer, 100);
close(fd);
Here are some results on our cluster:
One node, mode=read-only: 7000 opens/second
One node, mode=read-write: 7000 opens/second
Two nodes, mode=read-only: 7000 opens/second/node
Two nodes, mode=read-write: around 0.5 opens/second/node (!!!)
Two nodes, one read-only, one read-write: around 0.5
opens/second/node (!!!)
Two nodes, mode=read-write, but remove the 'read(fd, buffer,100)'
line from
the code: 500 opens/second/node
So there seems to be some problems with opening the same file
read/write and
reading from the file on multiple nodes. That operation seems to be 3
orders of magnitude slower than other parallel access patterns to
the same
file. The 1 second time to open files almost seems like some
timeout is
happening somewhere. I have some suspicion that this has to do with
capability management between the fuse client and the MDS, but I
don't know
enough about that protocol to make an educated assessment.
You're pretty much spot on. Things happening at 0.5 per second is
characteristic of a particular class of bug where we are not flushing
the journal soon enough, and instead waiting for the next periodic
(every five second) flush. Hence there is an average 2.5 second dely,
hence operations happening at approximately half an operation per
second.
[And an aside - how does this become a problem? I.e. why open a file
read/write and read from it? Well, it turns out gfortran compiled
code does
this by default if the user doesn't explicitly says otherwise].
All the nodes in this test are very lightly loaded, so there does
not seems
to be any noticeable performance bottleneck (network, CPU, etc.).
The code
to reproduce the problem is attached. Simply compile it, create a
test file
with a few bytes of data in it, and run the test code on two
separate nodes
on the same file.
We are running ceph 10.2.9 both on the server, and we use the 10.2.9
fuse
client on the client nodes.
Any input/help would be greatly appreciated.
If you have a test/staging environment, it would be great if you could
re-test this on the 12.1.1 release candidate. There have been MDS
fixes for similar slowdowns that were shown up in multi-mds testing,
so it's possible that the issue you're seeing here was fixed along the
way.
John
Andras
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com