Re: CephFS performance.

Ronny Aasen <ronny+ceph-users@xxxxxxxx> · Thu, 4 Oct 2018 11:09:58 +0200

On 10/4/18 7:04 AM, jesper@xxxxxxxx wrote:
Hi All.

First thanks for the good discussion and strong answer's I've gotten so far.

Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and
10GbitE and metadata on rotating drives - 3x replication - 256GB memory in
OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC.

Planned changes:
- is to get 1-2 more OSD-hosts
- experiment with EC-pools for CephFS
- MDS onto seperate host and metadata onto SSD's.

I'm still struggling to get "non-cached" performance up to "hardware"
speed - whatever that means. I do "fio" benchmark using 10GB files, 16
threads, 4M block size -- at which I can "almost" sustained fill the
10GbitE NIC. In this configuraiton I would have expected it to be "way
above" 10Gbit speed thus have the NIC not "almost" filled - but fully
filled - could that be the metadata activities .. but on "big files" and
read - that should not be much - right?

Above is actually ok for production, thus .. not a big issue, just
information.

Single threaded performance is still struggling

Cold HHD (read from disk in NFS-server end) / NFS performance:

jk@zebra01:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   15.86 GB in 00h00m27.53s:  589.88 MB/second

Local page cache (just to say it isn't the profiling tool delivering
limitations):
jk@zebra03:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   29.24 GB in 00h00m09.15s:    3.19 GB/second
jk@zebra03:~$

Now from the Ceph system:
jk@zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null
Summary:
Piped   36.79 GB in 00h03m47.66s:  165.49 MB/second

Can block/stripe-size be tuned? Does it make sense?
Does read-ahead on the CephFS kernel-client need tuning?
What performance are other people seeing?
Other thoughts - recommendations?

On some of the shares we're storing pretty large files (GB size) and
need the backup to move them to tape - so it is preferred to be capable
of filling an LTO6 drive's write speed to capacity with a single thread.

40'ish 7.2K RPM drives - should - add up to more than above.. right?
This is the only current load being put on the cluster - + 100MB/s
recovery traffic.

the problem with single threaded performance in ceph. Is that it reads 
the spindles in serial. so you are practically reading one and one 
drive, and see a single disk's performance, subtracted all the overheads 
from ceph, network, mds, etc.
So you do not get the combined performance of the drives, only one drive 
at the time. So the trick for ceph performance is to get more spindles 
working for you at the same time.

There are ways to get more performance out of a single thread:
- faster components in the path, ie faster disk/network/cpu/memory
- larger pre-fetching/read-ahead, with a large enough read-ahead more 
osd's will participate in reading simultaneously. [1] shows a table of 
benchmarks with different read-ahead sizes.
- erasure coding. while erasure coding does add latency vs replicated 
pools. You will get more spindles involved in reading in parallel. so 
for large sequential loads erasure coding can have a benefit.
- some sort of extra caching scheme, I have not looked at cachefiles, 
but it may provide some benefit.

you can also play with different cephfs implementations, there is a fuse 
client, where you can play with different cache solutions. But generally 
the kernel client is faster.

in rbd there is a fancy striping solution, by using --stripe-unit and 
--stripe-count. This would get more spindles running ; perhaps consider 
using rbd instead of cephfs if it fits the workload.

[1] 
https://tracker.ceph.com/projects/ceph/wiki/Kernel_client_read_ahead_optimization

good luck
Ronny Aasen
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com