On 10/4/18 7:04 AM, jesper@xxxxxxxx wrote:
Hi All.
First thanks for the good discussion and strong answer's I've gotten so far.
Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and
10GbitE and metadata on rotating drives - 3x replication - 256GB memory in
OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC.
Planned changes:
- is to get 1-2 more OSD-hosts
- experiment with EC-pools for CephFS
- MDS onto seperate host and metadata onto SSD's.
I'm still struggling to get "non-cached" performance up to "hardware"
speed - whatever that means. I do "fio" benchmark using 10GB files, 16
threads, 4M block size -- at which I can "almost" sustained fill the
10GbitE NIC. In this configuraiton I would have expected it to be "way
above" 10Gbit speed thus have the NIC not "almost" filled - but fully
filled - could that be the metadata activities .. but on "big files" and
read - that should not be much - right?
Above is actually ok for production, thus .. not a big issue, just
information.
Single threaded performance is still struggling
Cold HHD (read from disk in NFS-server end) / NFS performance:
jk@zebra01:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped 15.86 GB in 00h00m27.53s: 589.88 MB/second
Local page cache (just to say it isn't the profiling tool delivering
limitations):
jk@zebra03:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped 29.24 GB in 00h00m09.15s: 3.19 GB/second
jk@zebra03:~$
Now from the Ceph system:
jk@zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null
Summary:
Piped 36.79 GB in 00h03m47.66s: 165.49 MB/second
Can block/stripe-size be tuned? Does it make sense?
Does read-ahead on the CephFS kernel-client need tuning?
What performance are other people seeing?
Other thoughts - recommendations?
On some of the shares we're storing pretty large files (GB size) and
need the backup to move them to tape - so it is preferred to be capable
of filling an LTO6 drive's write speed to capacity with a single thread.
40'ish 7.2K RPM drives - should - add up to more than above.. right?
This is the only current load being put on the cluster - + 100MB/s
recovery traffic.
the problem with single threaded performance in ceph. Is that it reads
the spindles in serial. so you are practically reading one and one
drive, and see a single disk's performance, subtracted all the overheads
from ceph, network, mds, etc.
So you do not get the combined performance of the drives, only one drive
at the time. So the trick for ceph performance is to get more spindles
working for you at the same time.
There are ways to get more performance out of a single thread:
- faster components in the path, ie faster disk/network/cpu/memory
- larger pre-fetching/read-ahead, with a large enough read-ahead more
osd's will participate in reading simultaneously. [1] shows a table of
benchmarks with different read-ahead sizes.
- erasure coding. while erasure coding does add latency vs replicated
pools. You will get more spindles involved in reading in parallel. so
for large sequential loads erasure coding can have a benefit.
- some sort of extra caching scheme, I have not looked at cachefiles,
but it may provide some benefit.
you can also play with different cephfs implementations, there is a fuse
client, where you can play with different cache solutions. But generally
the kernel client is faster.
in rbd there is a fancy striping solution, by using --stripe-unit and
--stripe-count. This would get more spindles running ; perhaps consider
using rbd instead of cephfs if it fits the workload.
[1]
https://tracker.ceph.com/projects/ceph/wiki/Kernel_client_read_ahead_optimization
good luck
Ronny Aasen
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com