Re: Ceph-fuse huge performance gap between different block sizes

Christian Balzer <chibi@xxxxxxx> · Sat, 26 Mar 2016 12:19:03 +0900

On Fri, 25 Mar 2016 09:17:08 +0000 Zhang Qiang wrote:

> Hi Christian, Thanks for your reply, here're the test specs:
> >>>
> [global]
> ioengine=libaio
> runtime=90
> direct=1
There it is.

You do understand what that flag does and what latencies are, right?

You're basically telling the I/O stack to not acknowledge the write until
it has reached the actual storage.

So your 4KB block has to traverse all the way from your client to the
storage node holding the primary PG, being written to the journal there,
same with any replicas, THEN that primary PG needs to send back an ACK to
the client that it has been done.
That amounts to about 3300 IOPS in your case, not actually bad, 0.3ms RTT.

As I said before, that's the sum of all your network latencies and Ceph
code overhead. 

If you have RBD caching (look it up) enabled on your fuse client machine
AND set direct=0, those writes can be coalesced ideally into 4MB perfect
Ceph block sized operations.

Christian

> group_reporting
> iodepth=16
> ramp_time=5
> size=1G
> 
> [seq_w_4k_20]
> bs=4k
> filename=seq_w_4k_20
> rw=write
> numjobs=20
> 
> [seq_w_1m_20]
> bs=1m
> filename=seq_w_1m_20
> rw=write
> numjobs=20
> <<<<
> 
> Test results: 4k -  aggrb=13245KB/s, 1m - aggrb=1102.6MB/s
> 
> Mount options:  ceph-fuse /ceph -m 10.3.138.36:6789
> 
> Ceph configurations:
> >>>>
> filestore_xattr_use_omap = true
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> osd journal size = 128
> osd pool default size = 2
> osd pool default min size = 1
> osd pool default pg num = 512
> osd pool default pgp num = 512
> osd crush chooseleaf type = 1
> <<<<
> 
> Other configurations are all default.
> 
> Status:
>      health HEALTH_OK
>      monmap e5: 5 mons at {1=
> 10.3.138.37:6789/0,2=10.3.138.39:6789/0,3=10.3.138.40:6789/0,4=10.3.138.59:6789/0,GGZ-YG-S0311-PLATFORM-138=10.3.138.36:6789/0
> }
>             election epoch 28, quorum 0,1,2,3,4
> GGZ-YG-S0311-PLATFORM-138,1,2,3,4
>      mdsmap e55: 1/1/1 up {0=1=up:active}
>      osdmap e1290: 20 osds: 20 up, 20 in
>       pgmap v7180: 1000 pgs, 2 pools, 14925 MB data, 3851 objects
>             37827 MB used, 20837 GB / 21991 GB avail
>                 1000 active+clean
> 
> On Fri, 25 Mar 2016 at 16:44 Christian Balzer <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > On Fri, 25 Mar 2016 08:11:27 +0000 Zhang Qiang wrote:
> >
> > > Hi all,
> > >
> > > According to fio,
> > Exact fio command please.
> >
> > >with 4k block size, the sequence write performance of
> > > my ceph-fuse mount
> >
> > Exact mount options, ceph config (RBD cache) please.
> >
> > >is just about 20+ M/s, only 200 Mb of 1 Gb full
> > > duplex NIC outgoing bandwidth was used for maximum. But for 1M block
> > > size the performance could achieve as high as 1000 M/s, approaching
> > > the limit of the NIC bandwidth. Why the performance stats differs so
> > > mush for different block sizes?
> > That's exactly why.
> > You can see that with local attached storage as well, many small
> > requests are slower than large (essential sequential) writes.
> > Network attached storage in general (latency) and thus Ceph as well
> > (plus code overhead) amplify that.
> >
> > >Can I configure ceph-fuse mount's block size
> > > for maximum performance?
> > >
> > Very little to do with that if you're using sync writes (thus the fio
> > command line pleasE), if not RBD cache could/should help.
> >
> > Christian
> >
> > > Basic information about the cluster: 20 OSDs on separate PCIe hard
> > > disks distributed across 2 servers, each with write performance
> > > about 300 M/s; 5 MONs; 1 MDS. Ceph version 0.94.6
> > > (e832001feaf8c176593e0325c8298e3f16dfb403).
> > >
> > > Thanks :)
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com