Re: Bad IO performance CephFS vs. NFS for block size 4k/128k

Christian Balzer <chibi@xxxxxxx> · Tue, 5 Sep 2017 09:45:44 +0900

Hello,

On Mon, 04 Sep 2017 15:27:34 +0000 c.monty@xxxxxx wrote:

> Hello!
> 
> I'm validating IO performance of CephFS vs. NFS.
>
Well, at this point you seem to be comparing apples to bananas.

You're telling us results, but your mail lacks crucial information required
to give you a qualified answer. 

> Therefore I have mounted the relevant filesystems on the same client.
Kernel client for Ceph?
Which Ceph version anyway?

> Then I start fio with the following parameters:
Full fio command line, this is relevant especially considering the size of
the file being worked on.

> action = randwrite randrw
> blocksize = 4k 128k 8m
> rwmixreadread = 70 50 30
> 32 jobs run in parallel
> 

You really need to tell us more about your setup, server HW
(CPU/RAM/controller/HDD types, etc) and network.

> The NFS share is striping over 5 virtual disks with a 4+1 RAID5 configuration; each disk has ~8TB.
OK, as mentioned above, HW details , which 8TB drives? 
Single host? What server/chassis is this? 
CPU/RAM, HW cache of the RAID controller (which model?) 

A RAID5, really? 
Why don't you do your tests on something you'd actually deploy in
production? Read: RAID6.

> The CephFS is configured on 2 MDS servers (1 up:active, 1 up:standby); each MDS has 47 OSDs where 1 OSD is represented by single 8TB disk.
Again HW, especially CPU and RAM.
MDS servers dont't "have" OSDs, usually they're separate from OSDs servers
which have OSDs.
>From what we can guess is that you have two combined MON/OSD/MDS servers,
each with 47 OSDs in them correct?
No SSD journals, all on HDDs?
Filestore or Bluestore?

Replication is 2 then?
Again, something you wouldn't do in production, so your results are
flawed. 

> (The disks of RAID5 and OSD are identical.)
> 

The results would be more interesting/telling if they included IOPS and
service times.
Comparing bandwidth in this case is sufficiently indicative, but not
really something you'll be looking for in most use cases.

As for the results, not surprising per se.
Reasons for NFS being faster than CephFS in your setup likely include all
of the below, none of which are really FS specific:

1. Latency
Since Ceph (RADOS) needs to replicate each write to 2 (assumption) OSDs on
both (assumption) OSD servers, the network latencies and Ceph OSD code
overhead tend to severely impact IOPS for small IOPS.
You simply can't compare a distributed storage system to something running
on a single node. 

2. HW RAID cache
Assuming you have something like a 2GB HW cache on that RAID controller
for your NFS tests, that will massively benefit small IOPS and thus
improve things for these tests.

3. Size of the fio test file
If your test file is small, the chances increase that Ceph will not take
full advantage of all the OSDs present. 
This could be improved by doing fancy striping, something you need to
decide on before creating pools and base on your use case.
OTOH, a smallish test file may fit entirely into the HW caches above.

Christian

> What I can see is that the IO performance of blocksize 8m is slightly better with CephFS, but worse (by factor 4-10) with blocksize 4k / 128k. Here the stats for randrw with mix 30:
> ld9930:/home # tail -n 3 ld9930-fio-test-cephfs-randrw30-8m
> Run status group 0 (all jobs):
>    READ: bw=335MiB/s (351MB/s), 335MiB/s-335MiB/s (351MB/s-351MB/s), io=19.7GiB (21.2GB), run=60099-60099msec
>   WRITE: bw=753MiB/s (789MB/s), 753MiB/s-753MiB/s (789MB/s-789MB/s), io=44.2GiB (47.5GB), run=60099-60099msec
> 
> ld9930:/home # tail -n 3 ld9930-fio-test-nfs-randrw30-8m
> Run status group 0 (all jobs):
>    READ: bw=324MiB/s (340MB/s), 324MiB/s-324MiB/s (340MB/s-340MB/s), io=19.0GiB (20.5GB), run=60052-60052msec
>   WRITE: bw=725MiB/s (760MB/s), 725MiB/s-725MiB/s (760MB/s-760MB/s), io=42.6GiB (45.7GB), run=60052-60052msec
> 
> ld9930:/home # tail -n 3 ld9930-fio-test-nfs-randrw30-128k
> Run status group 0 (all jobs):
>    READ: bw=287MiB/s (301MB/s), 287MiB/s-287MiB/s (301MB/s-301MB/s), io=16.9GiB (18.7GB), run=60006-60006msec
>   WRITE: bw=667MiB/s (700MB/s), 667MiB/s-667MiB/s (700MB/s-700MB/s), io=39.1GiB (41.1GB), run=60006-60006msec
> 
> ld9930:/home # tail -n 3 ld9930-fio-test-cephfs-randrw30-128k
> Run status group 0 (all jobs):
>    READ: bw=69.2MiB/s (72.6MB/s), 69.2MiB/s-69.2MiB/s (72.6MB/s-72.6MB/s), io=4172MiB (4375MB), run=60310-60310msec
>   WRITE: bw=161MiB/s (169MB/s), 161MiB/s-161MiB/s (169MB/s-169MB/s), io=9732MiB (10.3GB), run=60310-60310msec
> 
> ld9930:/home # tail -n 3 ld9930-fio-test-cephfs-randrw30-4k
> Run status group 0 (all jobs):
>    READ: bw=5631KiB/s (5766kB/s), 5631KiB/s-5631KiB/s (5766kB/s-5766kB/s), io=330MiB (346MB), run=60043-60043msec
>   WRITE: bw=12.8MiB/s (13.4MB/s), 12.8MiB/s-12.8MiB/s (13.4MB/s-13.4MB/s), io=767MiB (804MB), run=60043-60043msec
> 
> ld9930:/home # tail -n 3 ld9930-fio-test-nfs-randrw30-4k
> Run status group 0 (all jobs):
>    READ: bw=77.2MiB/s (80.8MB/s), 77.2MiB/s-77.2MiB/s (80.8MB/s-80.8MB/s), io=4621MiB (4846MB), run=60004-60004msec
>   WRITE: bw=180MiB/s (188MB/s), 180MiB/s-180MiB/s (188MB/s-188MB/s), io=10.6GiB (11.4GB), run=60004-60004msec
> 
> 
> This implies that for good IO performance only data with blocksize > 128k (I guess > 1M) should be used.
> Can anybody confirm this?
> 
> THX
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com