Re: Write IO Problem

Rottmann Jonas <j.rottmann@xxxxxxxxxx> · Tue, 24 Mar 2015 07:43:00 +0000

Hi,

First of all, thank you for your detailed answer.

My Ceph Version is Hammer, sry should have mentioned that.

Yes we have 2 Intel 320 for the OS, the think process behind this was that the OS Disk is not that important, and they were cheap but SSDs(power consumption).

The Plan was to put the cluster into production if it works well, but that way we will have to replace the SSDs.

We chose BTRFS because it is stable, and you can often read it is more performant than XFS.
(What speaks against BTRFS besides fragmentation which is irrelevant for SSDs?)

I'm sry that I used different benchmarks, but they all were far far away from that what I would expect.

I could narrow it down to the problem the SSDs performing very bad with direct and dsync.

What I don't understand how can it be that the internal benchmark gives good values, when each osd only give ~200IOPS with direct,dsync?

Hello,

If you had used "performance" or "slow" in your subject future generations would be able find this thread and what it is about more easily. ^_-

Also, check the various "SSD" + "performance" threads in the ML archives.

On Fri, 20 Mar 2015 14:13:19 +0000 Rottmann Jonas wrote:

> Hi,
> 
> We have a huge write IO Problem in our preproductive Ceph Cluster. 
> First our Hardware:
> 
You're not telling us your Ceph version, but from the tunables below I suppose it is Firefly?
If you have the time, it would definitely be advisable to wait for Hammer with an all SSD cluster.

> 4 OSD Nodes with:
> 
> Supermicro X10 Board
> 32GB DDR4 RAM
> 2x Intel Xeon E5-2620
> LSI SAS 9300-8i Host Bus Adapter
> Intel Corporation 82599EB 10-Gigabit
> 2x Intel SSDSA2CT040G3 in software raid 1 for system
> 
Nobody really knows what those inane Intel product codes are without looking them up. 
So you have 2 Intel 320 40GB consumer SSDs that are EOL'ed for the OS.
In a very modern, up to date system otherwise...

When you say "pre-production" cluster up there, does that mean that this is purely a test bed, or are you planning to turn this into production eventually?

> Disks:
> 2x Samsung EVO 840 1TB
> 
Unless you're planning to do _very_ little writes, these will wear out in no time. 
With small IOPS (4KB) you can see up to 12x write amplification with Ceph.
Consider investing in data center level SSDs like the 845 DC PRO or comparable Intel (S3610, S3700).

> So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only 
> added nodiratime)
> 
Why BTRFS?
As in, what made you feel that this was a good, safe choice?
I guess with SSDs for backing storage you won't at least have to worry about the massive fragmentation of BTRFS with Ceph...

> Benchmarking one disk alone gives good values:
> 
> dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
> 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s
> 
> Fio 8k libaio depth=32:
> write: io=488184KB, bw=52782KB/s, iops=5068 , runt=  9249msec
>
And this is where you start comparing apples to oranges.
That fio was with 8KB blocks and 32 threads.

> Here our ceph.conf (pretty much standard):
> 
> [global]
> fsid = 89191a54-740a-46c7-a325-0899ab32fd1d
> mon initial members = cephasp41,ceph-monitor41 mon host = 
> 172.30.10.15,172.30.10.19 public network = 172.30.10.0/24 cluster 
> network = 172.30.10.0/24 auth cluster required = cephx auth service 
> required = cephx auth client required = cephx
> 
> #Default is 1GB, which is fine for us
> #osd journal size = {n}
> 
> #Only needed if ext4 comes to play
> #filestore xattr use omap = true
> 
> osd pool default size = 3  # Write an object n times.
> osd pool default min size = 2 # Allow writing n copy in a degraded state.
> 
Normally I'd say a replication of 2 is sufficient with SSDs, but given your choice of SSDs I'll refrain from that.

> #Set individual per pool by a formula
> #osd pool default pg num = {n}
> #osd pool default pgp num = {n}
> #osd crush chooseleaf type = {n}
> 
> 
> When I benchmark the cluster with "rbd bench-write rbd/fio" I get pretty
> good results: elapsed:    18  ops:   262144  ops/sec: 14466.30
> bytes/sec: 59253946.11
> 
Apple and oranges time again, this time you're testing with 4K blocks and
16 threads (defaults for this test).

Incidentally, I get this from a 3 node cluster (replication 3) with 8 OSDs per node (SATA disk, journals on 4 Intel DC S3700 100GB) and Infiniband
(4QDR) interconnect:
elapsed:     7  ops:   246724  ops/sec: 31157.87  bytes/sec: 135599456.06

> If I for example bench i.e. with fio with rbd engine, I get very poor
> results:
> 
> [global]
> ioengine=rbd
> clientname=admin
> pool=rbd
> rbdname=fio
> invalidate=0    # mandatory
> rw=randwrite
> bs=512k
> 
> [rbd_iodepth32]
> iodepth=32
> 
> RESULTS:
> ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec
>
Total apples and oranges time, now you're having 512KB blocks (which of course will reduce IOPS) and 32 threads.
The bandwidth is still about the same as before and if you multiply 105x128(to compensate for 4KB blocks) you wind with 13440, close to what you've seen with the rbd bench. 
Also from where are you benching?

> Also if I mount the rbd with kernel as rbd0, format it with ext4 and 
> then do a dd on it, its not that good: "dd if=/dev/zero of=tempfile 
> bs=1M count=1024 conv=fdatasync,notrunc" RESULT:
> 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s
> 
Mounting it where? 
Same system that you did the other tests from?

Did you format it w/o lazy init or waited until the lazy init finished before doing the test?

> I also tried presenting an rbd image with tgtd, mount it onto VMWare 
> ESXi and test it in a vm, there I got only round about 50 iops with 
> 4k, and writing sequentiell 25Mbytes. With NFS the read sequential 
> values are good (400Mbyte/s) but writing only 25Mbyte/s.
>
Can't really comment on that, many things that could cause this and I'm not an expert in either.

> What I tried tweaking so far:
>
I don't think that whatever you're seeing (aside from the apple and oranges bit) is caused by anything you tried tweaking below.

Regards,

Christian 
> Intel NIC optimazitions:
> /etc/sysctl.conf
> 
> # Increase system file descriptor limit fs.file-max = 65535
> 
> # Increase system IP port range to allow for more concurrent 
> connections net.ipv4.ip_local_port_range = 1024 65000
> 
> # -- 10gbe tuning from Intel ixgb driver README -- #
> 
> # turn off selective ACK and timestamps net.ipv4.tcp_sack = 0 
> net.ipv4.tcp_timestamps = 0
> 
> # memory allocation min/pressure/max.
> # read buffer, write buffer, and buffer space net.ipv4.tcp_rmem = 
> 10000000 10000000 10000000 net.ipv4.tcp_wmem = 10000000 10000000 
> 10000000 net.ipv4.tcp_mem = 10000000 10000000 10000000
> 
> net.core.rmem_max = 524287
> net.core.wmem_max = 524287
> net.core.rmem_default = 524287
> net.core.wmem_default = 524287
> net.core.optmem_max = 524287
> net.core.netdev_max_backlog = 300000
> 
> AND
> 
> setpci -v -d 8086:10fb e6.b=2e
> 
> 
> Setting tunables to firefly:
>             ceph osd crush tunables firefly
> 
> Setting scheduler to noop:
>             This basically stopped IO on the cluster, and I had to 
> revert it and restart some of the osds with requests stuck
> 
> And I tried moving the monitor from an VM to the Hardware where the 
> OSDs run.
> 
> 
> Any suggestions where to look, or what could cause that problem?
> (because I can't believe your loosing that much performance through 
> ceph
> replication)
> 
> Thanks in advance.
> 
> If you need any info please tell me.
> 
> Mit freundlichen Grüßen/Kind regards
> Jonas Rottmann
> Systems Engineer
> 
> FIS-ASP Application Service Providing und IT-Outsourcing GmbH 
> Röthleiner Weg 4
> D-97506 Grafenrheinfeld
> Phone: +49 (9723) 9188-568
> Fax: +49 (9723) 9188-600
> 
> email: j.rottmann@xxxxxxxxxx <mailto:j.rottmann@xxxxxxxxxx>  web:
> www.fis-asp.de
> 
> Geschäftsführer Robert Schuhmann
> Registergericht Schweinfurt HRB 3865

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com