Re: Write IO Problem

Christian Balzer <chibi@xxxxxxx> · Tue, 24 Mar 2015 22:00:29 +0900

Hello,

On Tue, 24 Mar 2015 07:43:00 +0000 Rottmann Jonas wrote:

> Hi,
> 
> First of all, thank you for your detailed answer.
> 
> My Ceph Version is Hammer, sry should have mentioned that.
> 
> Yes we have 2 Intel 320 for the OS, the think process behind this was
> that the OS Disk is not that important, and they were cheap but
> SSDs(power consumption).
> 
Fair enough.

> The Plan was to put the cluster into production if it works well, but
> that way we will have to replace the SSDs.
>
Those are definitely not durable enough for anything resembling normal
Ceph usage, never mind what other issues they may have in terms of speed.

Lets assume that your initial fio with 8KB blocks on the actual SSD means
that it can do about 10000 4KB write IOPS. With the journal on that same
SSD that means you really only have 5000 IOPS left.
8 OSDs would make 40000 IOPS (ignoring all the other overhead and
latency), replication of 3 means only 13400 IOPS in the end.

> We chose BTRFS because it is stable, and you can often read it is more
> performant than XFS. (What speaks against BTRFS besides fragmentation
> which is irrelevant for SSDs?)
>
There are many BTRFS threads in the ML archives, off the top of my head
regressions in certain kernels that affect stability in general come to
mind. And snapshots in particular with pretty much any kernel that was
discussed. 

> I'm sry that I used different benchmarks, but they all were far far away
> from that what I would expect.
> 
Again, have a look at the various SSD threads, what _did_ you expect?

But what your cluster is _actually_ capable of in the worst case is best
seen with "rados bench", no caching, just raw ability. 

Any improvements over that by RBD cache are just the icing on the cake,
don't take them for granted.

When doing the various benchmarks, keep an eye on all your storage nodes
with atop or the likes.

I wouldn't surprise me (as Alexandre suggested) that you're running out of
CPU power as well.

The best I got out of a single node, 8 SSD OSD "cluster" was about 4500
IOPS (4KB) with "rados bench" and the machine was totally CPU bound at
that point (Firefly).
http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html

> I could narrow it down to the problem the SSDs performing very bad with
> direct and dsync.
> 
> What I don't understand how can it be that the internal benchmark gives
> good values, when each osd only give ~200IOPS with direct,dsync?
> 
And once more, the fio with 512KB blocks very much matches the "rbd
bench-write" results you saw when adjusting for the different block sizes.

Christian
> 
> 
> Hello,
> 
> If you had used "performance" or "slow" in your subject future
> generations would be able find this thread and what it is about more
> easily. ^_-
> 
> Also, check the various "SSD" + "performance" threads in the ML archives.
> 
> On Fri, 20 Mar 2015 14:13:19 +0000 Rottmann Jonas wrote:
> 
> > Hi,
> > 
> > We have a huge write IO Problem in our preproductive Ceph Cluster. 
> > First our Hardware:
> > 
> You're not telling us your Ceph version, but from the tunables below I
> suppose it is Firefly? If you have the time, it would definitely be
> advisable to wait for Hammer with an all SSD cluster.
> 
> > 4 OSD Nodes with:
> > 
> > Supermicro X10 Board
> > 32GB DDR4 RAM
> > 2x Intel Xeon E5-2620
> > LSI SAS 9300-8i Host Bus Adapter
> > Intel Corporation 82599EB 10-Gigabit
> > 2x Intel SSDSA2CT040G3 in software raid 1 for system
> > 
> Nobody really knows what those inane Intel product codes are without
> looking them up. So you have 2 Intel 320 40GB consumer SSDs that are
> EOL'ed for the OS. In a very modern, up to date system otherwise...
> 
> When you say "pre-production" cluster up there, does that mean that this
> is purely a test bed, or are you planning to turn this into production
> eventually?
> 
> > Disks:
> > 2x Samsung EVO 840 1TB
> > 
> Unless you're planning to do _very_ little writes, these will wear out
> in no time. With small IOPS (4KB) you can see up to 12x write
> amplification with Ceph. Consider investing in data center level SSDs
> like the 845 DC PRO or comparable Intel (S3610, S3700).
> 
> 
> > So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only 
> > added nodiratime)
> > 
> Why BTRFS?
> As in, what made you feel that this was a good, safe choice?
> I guess with SSDs for backing storage you won't at least have to worry
> about the massive fragmentation of BTRFS with Ceph...
> 
> > Benchmarking one disk alone gives good values:
> > 
> > dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
> > 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s
> > 
> > Fio 8k libaio depth=32:
> > write: io=488184KB, bw=52782KB/s, iops=5068 , runt=  9249msec
> >
> And this is where you start comparing apples to oranges.
> That fio was with 8KB blocks and 32 threads.
>  
> > Here our ceph.conf (pretty much standard):
> > 
> > [global]
> > fsid = 89191a54-740a-46c7-a325-0899ab32fd1d
> > mon initial members = cephasp41,ceph-monitor41 mon host = 
> > 172.30.10.15,172.30.10.19 public network = 172.30.10.0/24 cluster 
> > network = 172.30.10.0/24 auth cluster required = cephx auth service 
> > required = cephx auth client required = cephx
> > 
> > #Default is 1GB, which is fine for us
> > #osd journal size = {n}
> > 
> > #Only needed if ext4 comes to play
> > #filestore xattr use omap = true
> > 
> > osd pool default size = 3  # Write an object n times.
> > osd pool default min size = 2 # Allow writing n copy in a degraded
> > state.
> > 
> Normally I'd say a replication of 2 is sufficient with SSDs, but given
> your choice of SSDs I'll refrain from that.
> 
> > #Set individual per pool by a formula
> > #osd pool default pg num = {n}
> > #osd pool default pgp num = {n}
> > #osd crush chooseleaf type = {n}
> > 
> > 
> > When I benchmark the cluster with "rbd bench-write rbd/fio" I get
> > pretty good results: elapsed:    18  ops:   262144  ops/sec: 14466.30
> > bytes/sec: 59253946.11
> > 
> Apple and oranges time again, this time you're testing with 4K blocks and
> 16 threads (defaults for this test).
> 
> Incidentally, I get this from a 3 node cluster (replication 3) with 8
> OSDs per node (SATA disk, journals on 4 Intel DC S3700 100GB) and
> Infiniband (4QDR) interconnect: elapsed:     7  ops:   246724  ops/sec:
> 31157.87  bytes/sec: 135599456.06
> 
> > If I for example bench i.e. with fio with rbd engine, I get very poor
> > results:
> > 
> > [global]
> > ioengine=rbd
> > clientname=admin
> > pool=rbd
> > rbdname=fio
> > invalidate=0    # mandatory
> > rw=randwrite
> > bs=512k
> > 
> > [rbd_iodepth32]
> > iodepth=32
> > 
> > RESULTS:
> > ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec
> >
> Total apples and oranges time, now you're having 512KB blocks (which of
> course will reduce IOPS) and 32 threads. The bandwidth is still about
> the same as before and if you multiply 105x128(to compensate for 4KB
> blocks) you wind with 13440, close to what you've seen with the rbd
> bench. Also from where are you benching? 
> > Also if I mount the rbd with kernel as rbd0, format it with ext4 and 
> > then do a dd on it, its not that good: "dd if=/dev/zero of=tempfile 
> > bs=1M count=1024 conv=fdatasync,notrunc" RESULT:
> > 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s
> > 
> Mounting it where? 
> Same system that you did the other tests from?
> 
> Did you format it w/o lazy init or waited until the lazy init finished
> before doing the test?
> 
> > I also tried presenting an rbd image with tgtd, mount it onto VMWare 
> > ESXi and test it in a vm, there I got only round about 50 iops with 
> > 4k, and writing sequentiell 25Mbytes. With NFS the read sequential 
> > values are good (400Mbyte/s) but writing only 25Mbyte/s.
> >
> Can't really comment on that, many things that could cause this and I'm
> not an expert in either. 
> > What I tried tweaking so far:
> >
> I don't think that whatever you're seeing (aside from the apple and
> oranges bit) is caused by anything you tried tweaking below.
> 
> Regards,
> 
> Christian 
> > Intel NIC optimazitions:
> > /etc/sysctl.conf
> > 
> > # Increase system file descriptor limit fs.file-max = 65535
> > 
> > # Increase system IP port range to allow for more concurrent 
> > connections net.ipv4.ip_local_port_range = 1024 65000
> > 
> > # -- 10gbe tuning from Intel ixgb driver README -- #
> > 
> > # turn off selective ACK and timestamps net.ipv4.tcp_sack = 0 
> > net.ipv4.tcp_timestamps = 0
> > 
> > # memory allocation min/pressure/max.
> > # read buffer, write buffer, and buffer space net.ipv4.tcp_rmem = 
> > 10000000 10000000 10000000 net.ipv4.tcp_wmem = 10000000 10000000 
> > 10000000 net.ipv4.tcp_mem = 10000000 10000000 10000000
> > 
> > net.core.rmem_max = 524287
> > net.core.wmem_max = 524287
> > net.core.rmem_default = 524287
> > net.core.wmem_default = 524287
> > net.core.optmem_max = 524287
> > net.core.netdev_max_backlog = 300000
> > 
> > AND
> > 
> > setpci -v -d 8086:10fb e6.b=2e
> > 
> > 
> > Setting tunables to firefly:
> >             ceph osd crush tunables firefly
> > 
> > Setting scheduler to noop:
> >             This basically stopped IO on the cluster, and I had to 
> > revert it and restart some of the osds with requests stuck
> > 
> > And I tried moving the monitor from an VM to the Hardware where the 
> > OSDs run.
> > 
> > 
> > Any suggestions where to look, or what could cause that problem?
> > (because I can't believe your loosing that much performance through 
> > ceph
> > replication)
> > 
> > Thanks in advance.
> > 
> > If you need any info please tell me.
> > 
> > Mit freundlichen Grüßen/Kind regards
> > Jonas Rottmann
> > Systems Engineer
> > 
> > FIS-ASP Application Service Providing und IT-Outsourcing GmbH 
> > Röthleiner Weg 4
> > D-97506 Grafenrheinfeld
> > Phone: +49 (9723) 9188-568
> > Fax: +49 (9723) 9188-600
> > 
> > email: j.rottmann@xxxxxxxxxx <mailto:j.rottmann@xxxxxxxxxx>  web:
> > www.fis-asp.de
> > 
> > Geschäftsführer Robert Schuhmann
> > Registergericht Schweinfurt HRB 3865
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com