Re: cephfs, low performances

"Yan, Zheng" <ukernel@xxxxxxxxx> · Mon, 21 Dec 2015 11:47:32 +0800

On Fri, Dec 18, 2015 at 11:16 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> Hello,
>
> On Fri, 18 Dec 2015 03:36:12 +0100 Francois Lafont wrote:
>
>> Hi,
>>
>> I have ceph cluster currently unused and I have (to my mind) very low
>> performances. I'm not an expert in benchs, here an example of quick
>> bench:
>>
>> ---------------------------------------------------------------
>> # fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
>> --name=readwrite --filename=rw.data --bs=4k --iodepth=64 --size=300MB
>> --readwrite=randrw --rwmixread=50 readwrite: (g=0): rw=randrw,
>> bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 fio-2.1.3 Starting 1
>> process readwrite: Laying out IO file(s) (1 file(s) / 300MB)
>> Jobs: 1 (f=1): [m] [100.0% done] [2264KB/2128KB/0KB /s] [566/532/0 iops]
>> [eta 00m:00s] readwrite: (groupid=0, jobs=1): err= 0: pid=3783: Fri Dec
>> 18 02:01:13 2015 read : io=153640KB, bw=2302.9KB/s, iops=575, runt=
>> 66719msec write: io=153560KB, bw=2301.7KB/s, iops=575, runt= 66719msec
>>   cpu          : usr=0.77%, sys=3.07%, ctx=115432, majf=0, minf=604
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>> >=64=99.9% submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>> >64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
>> >=64=0.0% issued    : total=r=38410/w=38390/d=0, short=r=0/w=0/d=0
>>
>> Run status group 0 (all jobs):
>>    READ: io=153640KB, aggrb=2302KB/s, minb=2302KB/s, maxb=2302KB/s,
>> mint=66719msec, maxt=66719msec WRITE: io=153560KB, aggrb=2301KB/s,
>> minb=2301KB/s, maxb=2301KB/s, mint=66719msec, maxt=66719msec
>> ---------------------------------------------------------------
>>

fio tests AIO performance in this case. cephfs does not handle AIO
properly, AIO is actually SYNC IO. that's why cephfs is so slow in
this case.

Regards
Yan, Zheng

>> It seems to me very bad.
> Indeed.
> Firstly let me state that I don't use CephFS and have no clues how this
> influences things and can/should be tuned.
>
> That being said, the fio above running in VM (RBD) gives me 440 IOPS
> against a single OSD storage server (replica 1) with 4 crappy HDDs and
> on-disk journals on my test cluster (1Gb/s links).
> So yeah, given your configuration that's bad.
>
> In comparison I get 3000 IOPS against a production cluster (so not idle)
> with 4 storage nodes. Each with 4 100GB DC S3700 for journals and OS and 8
> SATA HDDs, Infiniband (IPoIB) connectivity for everything.
>
> All of this is with .80.x (Firefly) on Debian Jessie.
>
>
>> Can I hope better results with my setup
>> (explained below)? During the bench, I don't see particular symptoms (no
>> CPU blocked at 100% etc). If you have advices to improve the perf and/or
>> maybe to make smarter benchs, I'm really interested.
>>
> You want to use atop on all your nodes and look for everything from disks
> to network utilization.
> There might be nothing obvious going on, but it needs to be ruled out.
>
>> Thanks in advance for your help. Here is my conf...
>>
>> I use Ubuntu 14.04 on each server with the 3.13 kernel (it's the same
>> for the client ceph where I run my bench) and I use Ceph 9.2.0
>> (Infernalis).
>
> I seem to recall that this particular kernel has issues, you might want to
> scour the archives here.
>
>>On the client, cephfs is mounted via cephfs-fuse with this
>> in /etc/fstab:
>>
>> id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/    /mnt/cephfs
>> fuse.ceph     noatime,defaults,_netdev        0       0
>>
>> I have 5 cluster node servers "Supermicro Motherboard X10SLM+-LN4 S1150"
>> with one 1GbE port for the ceph public network and one 10GbE port for
>> the ceph private network:
>>
> For the sake of latency (which becomes the biggest issues when you're not
> exhausting CPU/DISK), you'd be better off with everything on 10GbE, unless
> you need the 1GbE to connect to clients that have no 10Gb/s ports.
>
>> - 1 x Intel Xeon E3-1265Lv3
>> - 1 SSD DC3710 Series 200GB (with partitions for the OS, the 3
>> OSD-journals and, just for ceph01, ceph02 and ceph03, the SSD contains
>> too a partition for the workdir of a monitor
> The 200GB DC S3700 would have been faster, but that's a moot point and not
> your bottleneck for sure.
>
>> - 3 HD 4TB Western Digital (WD) SATA 7200rpm
>> - RAM 32GB
>> - NO RAID controlleur
>
> Which controller are you using?
> I recently came across an Adaptec SATA3 HBA that delivered only 176 MB/s
> writes with 200GB DC S3700s as opposed to 280MB/s when used with Intel
> onboard SATA-3 ports or a LSI 9211-4i HBA.
>
> Regards,
>
> Christian
>
>> - Each partition uses XFS with noatim option, except the OS partition in
>> EXT4.
>>
>> Here is my ceph.conf :
>>
>> ---------------------------------------------------------------
>> [global]
>>   fsid                           = xxxxxxxxxxxxxxxxxxxxxxxxxxxx
>>   cluster network                = 192.168.22.0/24
>>   public network                 = 10.0.2.0/24
>>   auth cluster required          = cephx
>>   auth service required          = cephx
>>   auth client required           = cephx
>>   filestore xattr use omap       = true
>>   osd pool default size          = 3
>>   osd pool default min size      = 1
>>   osd pool default pg num        = 64
>>   osd pool default pgp num       = 64
>>   osd crush chooseleaf type      = 1
>>   osd journal size               = 0
>>   osd max backfills              = 1
>>   osd recovery max active        = 1
>>   osd client op priority         = 63
>>   osd recovery op priority       = 1
>>   osd op threads                 = 4
>>   mds cache size                 = 1000000
>>   osd scrub begin hour           = 3
>>   osd scrub end hour             = 5
>>   mon allow pool delete          = false
>>   mon osd down out subtree limit = host
>>   mon osd min down reporters     = 4
>>
>> [mon.ceph01]
>>   host     = ceph01
>>   mon addr = 10.0.2.101
>>
>> [mon.ceph02]
>>   host     = ceph02
>>   mon addr = 10.0.2.102
>>
>> [mon.ceph03]
>>   host     = ceph03
>>   mon addr = 10.0.2.103
>> ---------------------------------------------------------------
>>
>> mds are in active/standby mode.
>>
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com