Hello, On Mon, 04 Sep 2017 15:27:34 +0000 c.monty@xxxxxx wrote: > Hello! > > I'm validating IO performance of CephFS vs. NFS. > Well, at this point you seem to be comparing apples to bananas. You're telling us results, but your mail lacks crucial information required to give you a qualified answer. > Therefore I have mounted the relevant filesystems on the same client. Kernel client for Ceph? Which Ceph version anyway? > Then I start fio with the following parameters: Full fio command line, this is relevant especially considering the size of the file being worked on. > action = randwrite randrw > blocksize = 4k 128k 8m > rwmixreadread = 70 50 30 > 32 jobs run in parallel > You really need to tell us more about your setup, server HW (CPU/RAM/controller/HDD types, etc) and network. > The NFS share is striping over 5 virtual disks with a 4+1 RAID5 configuration; each disk has ~8TB. OK, as mentioned above, HW details , which 8TB drives? Single host? What server/chassis is this? CPU/RAM, HW cache of the RAID controller (which model?) A RAID5, really? Why don't you do your tests on something you'd actually deploy in production? Read: RAID6. > The CephFS is configured on 2 MDS servers (1 up:active, 1 up:standby); each MDS has 47 OSDs where 1 OSD is represented by single 8TB disk. Again HW, especially CPU and RAM. MDS servers dont't "have" OSDs, usually they're separate from OSDs servers which have OSDs. >From what we can guess is that you have two combined MON/OSD/MDS servers, each with 47 OSDs in them correct? No SSD journals, all on HDDs? Filestore or Bluestore? Replication is 2 then? Again, something you wouldn't do in production, so your results are flawed. > (The disks of RAID5 and OSD are identical.) > The results would be more interesting/telling if they included IOPS and service times. Comparing bandwidth in this case is sufficiently indicative, but not really something you'll be looking for in most use cases. As for the results, not surprising per se. Reasons for NFS being faster than CephFS in your setup likely include all of the below, none of which are really FS specific: 1. Latency Since Ceph (RADOS) needs to replicate each write to 2 (assumption) OSDs on both (assumption) OSD servers, the network latencies and Ceph OSD code overhead tend to severely impact IOPS for small IOPS. You simply can't compare a distributed storage system to something running on a single node. 2. HW RAID cache Assuming you have something like a 2GB HW cache on that RAID controller for your NFS tests, that will massively benefit small IOPS and thus improve things for these tests. 3. Size of the fio test file If your test file is small, the chances increase that Ceph will not take full advantage of all the OSDs present. This could be improved by doing fancy striping, something you need to decide on before creating pools and base on your use case. OTOH, a smallish test file may fit entirely into the HW caches above. Christian > What I can see is that the IO performance of blocksize 8m is slightly better with CephFS, but worse (by factor 4-10) with blocksize 4k / 128k. Here the stats for randrw with mix 30: > ld9930:/home # tail -n 3 ld9930-fio-test-cephfs-randrw30-8m > Run status group 0 (all jobs): > READ: bw=335MiB/s (351MB/s), 335MiB/s-335MiB/s (351MB/s-351MB/s), io=19.7GiB (21.2GB), run=60099-60099msec > WRITE: bw=753MiB/s (789MB/s), 753MiB/s-753MiB/s (789MB/s-789MB/s), io=44.2GiB (47.5GB), run=60099-60099msec > > ld9930:/home # tail -n 3 ld9930-fio-test-nfs-randrw30-8m > Run status group 0 (all jobs): > READ: bw=324MiB/s (340MB/s), 324MiB/s-324MiB/s (340MB/s-340MB/s), io=19.0GiB (20.5GB), run=60052-60052msec > WRITE: bw=725MiB/s (760MB/s), 725MiB/s-725MiB/s (760MB/s-760MB/s), io=42.6GiB (45.7GB), run=60052-60052msec > > ld9930:/home # tail -n 3 ld9930-fio-test-nfs-randrw30-128k > Run status group 0 (all jobs): > READ: bw=287MiB/s (301MB/s), 287MiB/s-287MiB/s (301MB/s-301MB/s), io=16.9GiB (18.7GB), run=60006-60006msec > WRITE: bw=667MiB/s (700MB/s), 667MiB/s-667MiB/s (700MB/s-700MB/s), io=39.1GiB (41.1GB), run=60006-60006msec > > ld9930:/home # tail -n 3 ld9930-fio-test-cephfs-randrw30-128k > Run status group 0 (all jobs): > READ: bw=69.2MiB/s (72.6MB/s), 69.2MiB/s-69.2MiB/s (72.6MB/s-72.6MB/s), io=4172MiB (4375MB), run=60310-60310msec > WRITE: bw=161MiB/s (169MB/s), 161MiB/s-161MiB/s (169MB/s-169MB/s), io=9732MiB (10.3GB), run=60310-60310msec > > ld9930:/home # tail -n 3 ld9930-fio-test-cephfs-randrw30-4k > Run status group 0 (all jobs): > READ: bw=5631KiB/s (5766kB/s), 5631KiB/s-5631KiB/s (5766kB/s-5766kB/s), io=330MiB (346MB), run=60043-60043msec > WRITE: bw=12.8MiB/s (13.4MB/s), 12.8MiB/s-12.8MiB/s (13.4MB/s-13.4MB/s), io=767MiB (804MB), run=60043-60043msec > > ld9930:/home # tail -n 3 ld9930-fio-test-nfs-randrw30-4k > Run status group 0 (all jobs): > READ: bw=77.2MiB/s (80.8MB/s), 77.2MiB/s-77.2MiB/s (80.8MB/s-80.8MB/s), io=4621MiB (4846MB), run=60004-60004msec > WRITE: bw=180MiB/s (188MB/s), 180MiB/s-180MiB/s (188MB/s-188MB/s), io=10.6GiB (11.4GB), run=60004-60004msec > > > This implies that for good IO performance only data with blocksize > 128k (I guess > 1M) should be used. > Can anybody confirm this? > > THX > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com