Re: Question on Sequential Write performance at 4K blocksize

Christian Balzer <chibi@xxxxxxx> · Thu, 14 Jul 2016 12:50:00 +0900

Hello,

On Wed, 13 Jul 2016 18:15:10 +0000 EP Komarla wrote:

> Hi All,
> 
> Have a question on the performance of sequential write @ 4K block sizes.
> 
Which version of Ceph?
Any significant ceph.conf modifications?

> Here is my configuration:
> 
> Ceph Cluster: 6 Nodes. Each node with :-
> 20x HDDs (OSDs) - 10K RPM 1.2 TB SAS disks
> SSDs - 4x - Intel S3710, 400GB; for OSD journals shared across 20 HDDs (i.e., SSD journal ratio 1:5)
> 
> Network:
> - Client network - 10Gbps
> - Cluster network - 10Gbps
> - Each node with dual NIC - Intel 82599 ES - driver version 4.0.1
> 
> Traffic generators:
> 2 client servers - running on dual Intel sockets with 16 physical cores (32 cores with hyper-threading enabled)
> 
Are you mounting a RBD image on those servers via the kernel interface and
if so which kernel version?
Are you running the test inside a VM on those servers, or are you using
the RBD ioengine with fio?

> Test program:
> FIO - sequential read/write; random read/write

Exact fio command line please.

> Blocksizes - 4k, 32k, 256k...
> FIO - Number of jobs = 32; IO depth = 64
> Runtime = 10 minutes; Ramptime = 5 minutes
> Filesize = 4096g (5TB)
> 
> I observe that my sequential write performance at 4K block size is very low - I am getting around 6MB/sec bandwidth.  The performance improves significantly at larger block sizes (shown below)
> 
This is to some extend expected and normal.
You can see this behavior on local storage as well, just not as
pronounced.

Your main enemy here is latency, each write potentially needs to be sent
to the storage server(s, replication!) and then ACK'ed back to the client.

If your fio command line has sync writes (aka direct=1) things will be the
worst.

Small IOPs also stress your CPU's, look at atop on your storage nodes
during a 4KB fio run. 
That might also show other issues (as in overloaded HDDs/SSDs).

RBD caching (is it enabled on your clients?) can help with non-direct
writes.

That all being said, if I run this fio inside a VM (with RBD caching
enabled) against a cluster here with 4 nodes connected by QDR (40Gb/s)
Infiniband, 4x100GB DC S3700 and 8x plain SATA HDDs, I get:
---
# fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=write --name=fiojob --blocksize=4K --iodepth=32 

  write: io=4096.0MB, bw=134274KB/s, iops=33568 , runt= 31237msec

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=134273KB/s, minb=134273KB/s, maxb=134273KB/s, mint=31237msec, maxt=31237msec
---

And with buffered I/O (direct=0) I get:
---
  write: io=4096.0MB, bw=359194KB/s, iops=89798 , runt= 11677msec

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=359193KB/s, minb=359193KB/s, maxb=359193KB/s, mint=11677msec, maxt=11677msec
---

Increasing numjobs of course reduces the performance per job, so numjob=2
will give half the speed per individual job.

So something is fishy with your setup, unless the 5.6MB/s below there are
the results PER JOB, which would make it 180MB/s with 32 jobs or even
360MB/s with 64 jobs and a pretty decent and expected result.

Christian

> FIO - Sequential Write test
> 
> Block Size
> 
> Sequential Write Bandwidth KB/Sec
> 
> 4K
> 
> 5694
> 
> 32K
> 
> 141020
> 
> 256K
> 
> 747421
> 
> 1024K
> 
> 602236
> 
> 4096K
> 
> 683029
> 
> 
> Here are my questions:
> - Why is the sequential write performance at 4K block size so low? Is this in-line what others see?
> - Is it because of less number of clients, i.e., traffic generators? I am planning to increase the number of clients to 4 servers.
> - There is a later version on NIC driver from Intel, v4.3.15 - do you think upgrading to later version (v4.3.15) will improve performance?
> 
> Any thoughts or pointers will be helpful.
> 
> Thanks,
> 
> - epk
> 
> Legal Disclaimer:
> The information contained in this message may be privileged and confidential. It is intended to be read only by the individual or entity to whom it is addressed or by their designee. If the reader of this message is not the intended recipient, you are on notice that any distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and delete or destroy any copy of this message!

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com