Re: Write IO Problem

Christian Balzer <chibi@xxxxxxx> · Tue, 24 Mar 2015 22:29:39 +0900

On Tue, 24 Mar 2015 07:24:05 -0600 Robert LeBlanc wrote:

> I can not reproduce the snapshot issue with BTRFS on the 3.17 kernel. 

Good to know.

I shall give that a spin on one of my test cluster nodes then, once a
kernel over 3.16 actually shows up in Debian sid. ^o^

Christian

>My
> test cluster had 48 btrfs OSDs on BTRFS for four months without an issue
> since going to 3.17. The only concern I have is potential slowness over
> time. We are not using compression. We are going production in one month
> and although we haven't had show stopping issues with BTRFS, we are still
> going to start on XFS. Our plan is to build a cluster as a target for our
> backup system and we will put BTRFS on that to prove it in a production
> setting.
> 
> Robert LeBlanc
> 
> Sent from a mobile device please excuse any typos.
> On Mar 24, 2015 7:00 AM, "Christian Balzer" <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > On Tue, 24 Mar 2015 07:43:00 +0000 Rottmann Jonas wrote:
> >
> > > Hi,
> > >
> > > First of all, thank you for your detailed answer.
> > >
> > > My Ceph Version is Hammer, sry should have mentioned that.
> > >
> > > Yes we have 2 Intel 320 for the OS, the think process behind this was
> > > that the OS Disk is not that important, and they were cheap but
> > > SSDs(power consumption).
> > >
> > Fair enough.
> >
> > > The Plan was to put the cluster into production if it works well, but
> > > that way we will have to replace the SSDs.
> > >
> > Those are definitely not durable enough for anything resembling normal
> > Ceph usage, never mind what other issues they may have in terms of
> > speed.
> >
> > Lets assume that your initial fio with 8KB blocks on the actual SSD
> > means that it can do about 10000 4KB write IOPS. With the journal on
> > that same SSD that means you really only have 5000 IOPS left.
> > 8 OSDs would make 40000 IOPS (ignoring all the other overhead and
> > latency), replication of 3 means only 13400 IOPS in the end.
> >
> > > We chose BTRFS because it is stable, and you can often read it is
> > > more performant than XFS. (What speaks against BTRFS besides
> > > fragmentation which is irrelevant for SSDs?)
> > >
> > There are many BTRFS threads in the ML archives, off the top of my head
> > regressions in certain kernels that affect stability in general come to
> > mind. And snapshots in particular with pretty much any kernel that was
> > discussed.
> >
> > > I'm sry that I used different benchmarks, but they all were far far
> > > away from that what I would expect.
> > >
> > Again, have a look at the various SSD threads, what _did_ you expect?
> >
> > But what your cluster is _actually_ capable of in the worst case is
> > best seen with "rados bench", no caching, just raw ability.
> >
> > Any improvements over that by RBD cache are just the icing on the cake,
> > don't take them for granted.
> >
> > When doing the various benchmarks, keep an eye on all your storage
> > nodes with atop or the likes.
> >
> > I wouldn't surprise me (as Alexandre suggested) that you're running
> > out of CPU power as well.
> >
> > The best I got out of a single node, 8 SSD OSD "cluster" was about 4500
> > IOPS (4KB) with "rados bench" and the machine was totally CPU bound at
> > that point (Firefly).
> >
> > http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
> >
> > > I could narrow it down to the problem the SSDs performing very bad
> > > with direct and dsync.
> > >
> > > What I don't understand how can it be that the internal benchmark
> > > gives good values, when each osd only give ~200IOPS with
> > > direct,dsync?
> > >
> > And once more, the fio with 512KB blocks very much matches the "rbd
> > bench-write" results you saw when adjusting for the different block
> > sizes.
> >
> > Christian
> > >
> > >
> > > Hello,
> > >
> > > If you had used "performance" or "slow" in your subject future
> > > generations would be able find this thread and what it is about more
> > > easily. ^_-
> > >
> > > Also, check the various "SSD" + "performance" threads in the ML
> > > archives.
> > >
> > > On Fri, 20 Mar 2015 14:13:19 +0000 Rottmann Jonas wrote:
> > >
> > > > Hi,
> > > >
> > > > We have a huge write IO Problem in our preproductive Ceph Cluster.
> > > > First our Hardware:
> > > >
> > > You're not telling us your Ceph version, but from the tunables below
> > > I suppose it is Firefly? If you have the time, it would definitely be
> > > advisable to wait for Hammer with an all SSD cluster.
> > >
> > > > 4 OSD Nodes with:
> > > >
> > > > Supermicro X10 Board
> > > > 32GB DDR4 RAM
> > > > 2x Intel Xeon E5-2620
> > > > LSI SAS 9300-8i Host Bus Adapter
> > > > Intel Corporation 82599EB 10-Gigabit
> > > > 2x Intel SSDSA2CT040G3 in software raid 1 for system
> > > >
> > > Nobody really knows what those inane Intel product codes are without
> > > looking them up. So you have 2 Intel 320 40GB consumer SSDs that are
> > > EOL'ed for the OS. In a very modern, up to date system otherwise...
> > >
> > > When you say "pre-production" cluster up there, does that mean that
> > > this is purely a test bed, or are you planning to turn this into
> > > production eventually?
> > >
> > > > Disks:
> > > > 2x Samsung EVO 840 1TB
> > > >
> > > Unless you're planning to do _very_ little writes, these will wear
> > > out in no time. With small IOPS (4KB) you can see up to 12x write
> > > amplification with Ceph. Consider investing in data center level SSDs
> > > like the 845 DC PRO or comparable Intel (S3610, S3700).
> > >
> > >
> > > > So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk,
> > > > only added nodiratime)
> > > >
> > > Why BTRFS?
> > > As in, what made you feel that this was a good, safe choice?
> > > I guess with SSDs for backing storage you won't at least have to
> > > worry about the massive fragmentation of BTRFS with Ceph...
> > >
> > > > Benchmarking one disk alone gives good values:
> > > >
> > > > dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
> > > > 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s
> > > >
> > > > Fio 8k libaio depth=32:
> > > > write: io=488184KB, bw=52782KB/s, iops=5068 , runt=  9249msec
> > > >
> > > And this is where you start comparing apples to oranges.
> > > That fio was with 8KB blocks and 32 threads.
> > >
> > > > Here our ceph.conf (pretty much standard):
> > > >
> > > > [global]
> > > > fsid = 89191a54-740a-46c7-a325-0899ab32fd1d
> > > > mon initial members = cephasp41,ceph-monitor41 mon host =
> > > > 172.30.10.15,172.30.10.19 public network = 172.30.10.0/24 cluster
> > > > network = 172.30.10.0/24 auth cluster required = cephx auth service
> > > > required = cephx auth client required = cephx
> > > >
> > > > #Default is 1GB, which is fine for us
> > > > #osd journal size = {n}
> > > >
> > > > #Only needed if ext4 comes to play
> > > > #filestore xattr use omap = true
> > > >
> > > > osd pool default size = 3  # Write an object n times.
> > > > osd pool default min size = 2 # Allow writing n copy in a degraded
> > > > state.
> > > >
> > > Normally I'd say a replication of 2 is sufficient with SSDs, but
> > > given your choice of SSDs I'll refrain from that.
> > >
> > > > #Set individual per pool by a formula
> > > > #osd pool default pg num = {n}
> > > > #osd pool default pgp num = {n}
> > > > #osd crush chooseleaf type = {n}
> > > >
> > > >
> > > > When I benchmark the cluster with "rbd bench-write rbd/fio" I get
> > > > pretty good results: elapsed:    18  ops:   262144  ops/sec:
> > > > 14466.30 bytes/sec: 59253946.11
> > > >
> > > Apple and oranges time again, this time you're testing with 4K
> > > blocks and 16 threads (defaults for this test).
> > >
> > > Incidentally, I get this from a 3 node cluster (replication 3) with 8
> > > OSDs per node (SATA disk, journals on 4 Intel DC S3700 100GB) and
> > > Infiniband (4QDR) interconnect: elapsed:     7  ops:   246724
> > > ops/sec: 31157.87  bytes/sec: 135599456.06
> > >
> > > > If I for example bench i.e. with fio with rbd engine, I get very
> > > > poor results:
> > > >
> > > > [global]
> > > > ioengine=rbd
> > > > clientname=admin
> > > > pool=rbd
> > > > rbdname=fio
> > > > invalidate=0    # mandatory
> > > > rw=randwrite
> > > > bs=512k
> > > >
> > > > [rbd_iodepth32]
> > > > iodepth=32
> > > >
> > > > RESULTS:
> > > > ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec
> > > >
> > > Total apples and oranges time, now you're having 512KB blocks (which
> > > of course will reduce IOPS) and 32 threads. The bandwidth is still
> > > about the same as before and if you multiply 105x128(to compensate
> > > for 4KB blocks) you wind with 13440, close to what you've seen with
> > > the rbd bench. Also from where are you benching?
> > > > Also if I mount the rbd with kernel as rbd0, format it with ext4
> > > > and then do a dd on it, its not that good: "dd if=/dev/zero
> > > > of=tempfile bs=1M count=1024 conv=fdatasync,notrunc" RESULT:
> > > > 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s
> > > >
> > > Mounting it where?
> > > Same system that you did the other tests from?
> > >
> > > Did you format it w/o lazy init or waited until the lazy init
> > > finished before doing the test?
> > >
> > > > I also tried presenting an rbd image with tgtd, mount it onto
> > > > VMWare ESXi and test it in a vm, there I got only round about 50
> > > > iops with 4k, and writing sequentiell 25Mbytes. With NFS the read
> > > > sequential values are good (400Mbyte/s) but writing only 25Mbyte/s.
> > > >
> > > Can't really comment on that, many things that could cause this and
> > > I'm not an expert in either.
> > > > What I tried tweaking so far:
> > > >
> > > I don't think that whatever you're seeing (aside from the apple and
> > > oranges bit) is caused by anything you tried tweaking below.
> > >
> > > Regards,
> > >
> > > Christian
> > > > Intel NIC optimazitions:
> > > > /etc/sysctl.conf
> > > >
> > > > # Increase system file descriptor limit fs.file-max = 65535
> > > >
> > > > # Increase system IP port range to allow for more concurrent
> > > > connections net.ipv4.ip_local_port_range = 1024 65000
> > > >
> > > > # -- 10gbe tuning from Intel ixgb driver README -- #
> > > >
> > > > # turn off selective ACK and timestamps net.ipv4.tcp_sack = 0
> > > > net.ipv4.tcp_timestamps = 0
> > > >
> > > > # memory allocation min/pressure/max.
> > > > # read buffer, write buffer, and buffer space net.ipv4.tcp_rmem =
> > > > 10000000 10000000 10000000 net.ipv4.tcp_wmem = 10000000 10000000
> > > > 10000000 net.ipv4.tcp_mem = 10000000 10000000 10000000
> > > >
> > > > net.core.rmem_max = 524287
> > > > net.core.wmem_max = 524287
> > > > net.core.rmem_default = 524287
> > > > net.core.wmem_default = 524287
> > > > net.core.optmem_max = 524287
> > > > net.core.netdev_max_backlog = 300000
> > > >
> > > > AND
> > > >
> > > > setpci -v -d 8086:10fb e6.b=2e
> > > >
> > > >
> > > > Setting tunables to firefly:
> > > >             ceph osd crush tunables firefly
> > > >
> > > > Setting scheduler to noop:
> > > >             This basically stopped IO on the cluster, and I had to
> > > > revert it and restart some of the osds with requests stuck
> > > >
> > > > And I tried moving the monitor from an VM to the Hardware where the
> > > > OSDs run.
> > > >
> > > >
> > > > Any suggestions where to look, or what could cause that problem?
> > > > (because I can't believe your loosing that much performance through
> > > > ceph
> > > > replication)
> > > >
> > > > Thanks in advance.
> > > >
> > > > If you need any info please tell me.
> > > >
> > > > Mit freundlichen Grüßen/Kind regards
> > > > Jonas Rottmann
> > > > Systems Engineer
> > > >
> > > > FIS-ASP Application Service Providing und IT-Outsourcing GmbH
> > > > Röthleiner Weg 4
> > > > D-97506 Grafenrheinfeld
> > > > Phone: +49 (9723) 9188-568
> > > > Fax: +49 (9723) 9188-600
> > > >
> > > > email: j.rottmann@xxxxxxxxxx <mailto:j.rottmann@xxxxxxxxxx>  web:
> > > > www.fis-asp.de
> > > >
> > > > Geschäftsführer Robert Schuhmann
> > > > Registergericht Schweinfurt HRB 3865
> > >
> > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com