On Tue, 24 Mar 2015 07:24:05 -0600 Robert LeBlanc wrote: > I can not reproduce the snapshot issue with BTRFS on the 3.17 kernel. Good to know. I shall give that a spin on one of my test cluster nodes then, once a kernel over 3.16 actually shows up in Debian sid. ^o^ Christian >My > test cluster had 48 btrfs OSDs on BTRFS for four months without an issue > since going to 3.17. The only concern I have is potential slowness over > time. We are not using compression. We are going production in one month > and although we haven't had show stopping issues with BTRFS, we are still > going to start on XFS. Our plan is to build a cluster as a target for our > backup system and we will put BTRFS on that to prove it in a production > setting. > > Robert LeBlanc > > Sent from a mobile device please excuse any typos. > On Mar 24, 2015 7:00 AM, "Christian Balzer" <chibi@xxxxxxx> wrote: > > > > > Hello, > > > > On Tue, 24 Mar 2015 07:43:00 +0000 Rottmann Jonas wrote: > > > > > Hi, > > > > > > First of all, thank you for your detailed answer. > > > > > > My Ceph Version is Hammer, sry should have mentioned that. > > > > > > Yes we have 2 Intel 320 for the OS, the think process behind this was > > > that the OS Disk is not that important, and they were cheap but > > > SSDs(power consumption). > > > > > Fair enough. > > > > > The Plan was to put the cluster into production if it works well, but > > > that way we will have to replace the SSDs. > > > > > Those are definitely not durable enough for anything resembling normal > > Ceph usage, never mind what other issues they may have in terms of > > speed. > > > > Lets assume that your initial fio with 8KB blocks on the actual SSD > > means that it can do about 10000 4KB write IOPS. With the journal on > > that same SSD that means you really only have 5000 IOPS left. > > 8 OSDs would make 40000 IOPS (ignoring all the other overhead and > > latency), replication of 3 means only 13400 IOPS in the end. > > > > > We chose BTRFS because it is stable, and you can often read it is > > > more performant than XFS. (What speaks against BTRFS besides > > > fragmentation which is irrelevant for SSDs?) > > > > > There are many BTRFS threads in the ML archives, off the top of my head > > regressions in certain kernels that affect stability in general come to > > mind. And snapshots in particular with pretty much any kernel that was > > discussed. > > > > > I'm sry that I used different benchmarks, but they all were far far > > > away from that what I would expect. > > > > > Again, have a look at the various SSD threads, what _did_ you expect? > > > > But what your cluster is _actually_ capable of in the worst case is > > best seen with "rados bench", no caching, just raw ability. > > > > Any improvements over that by RBD cache are just the icing on the cake, > > don't take them for granted. > > > > When doing the various benchmarks, keep an eye on all your storage > > nodes with atop or the likes. > > > > I wouldn't surprise me (as Alexandre suggested) that you're running > > out of CPU power as well. > > > > The best I got out of a single node, 8 SSD OSD "cluster" was about 4500 > > IOPS (4KB) with "rados bench" and the machine was totally CPU bound at > > that point (Firefly). > > > > http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html > > > > > I could narrow it down to the problem the SSDs performing very bad > > > with direct and dsync. > > > > > > What I don't understand how can it be that the internal benchmark > > > gives good values, when each osd only give ~200IOPS with > > > direct,dsync? > > > > > And once more, the fio with 512KB blocks very much matches the "rbd > > bench-write" results you saw when adjusting for the different block > > sizes. > > > > Christian > > > > > > > > > Hello, > > > > > > If you had used "performance" or "slow" in your subject future > > > generations would be able find this thread and what it is about more > > > easily. ^_- > > > > > > Also, check the various "SSD" + "performance" threads in the ML > > > archives. > > > > > > On Fri, 20 Mar 2015 14:13:19 +0000 Rottmann Jonas wrote: > > > > > > > Hi, > > > > > > > > We have a huge write IO Problem in our preproductive Ceph Cluster. > > > > First our Hardware: > > > > > > > You're not telling us your Ceph version, but from the tunables below > > > I suppose it is Firefly? If you have the time, it would definitely be > > > advisable to wait for Hammer with an all SSD cluster. > > > > > > > 4 OSD Nodes with: > > > > > > > > Supermicro X10 Board > > > > 32GB DDR4 RAM > > > > 2x Intel Xeon E5-2620 > > > > LSI SAS 9300-8i Host Bus Adapter > > > > Intel Corporation 82599EB 10-Gigabit > > > > 2x Intel SSDSA2CT040G3 in software raid 1 for system > > > > > > > Nobody really knows what those inane Intel product codes are without > > > looking them up. So you have 2 Intel 320 40GB consumer SSDs that are > > > EOL'ed for the OS. In a very modern, up to date system otherwise... > > > > > > When you say "pre-production" cluster up there, does that mean that > > > this is purely a test bed, or are you planning to turn this into > > > production eventually? > > > > > > > Disks: > > > > 2x Samsung EVO 840 1TB > > > > > > > Unless you're planning to do _very_ little writes, these will wear > > > out in no time. With small IOPS (4KB) you can see up to 12x write > > > amplification with Ceph. Consider investing in data center level SSDs > > > like the 845 DC PRO or comparable Intel (S3610, S3700). > > > > > > > > > > So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, > > > > only added nodiratime) > > > > > > > Why BTRFS? > > > As in, what made you feel that this was a good, safe choice? > > > I guess with SSDs for backing storage you won't at least have to > > > worry about the massive fragmentation of BTRFS with Ceph... > > > > > > > Benchmarking one disk alone gives good values: > > > > > > > > dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc > > > > 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s > > > > > > > > Fio 8k libaio depth=32: > > > > write: io=488184KB, bw=52782KB/s, iops=5068 , runt= 9249msec > > > > > > > And this is where you start comparing apples to oranges. > > > That fio was with 8KB blocks and 32 threads. > > > > > > > Here our ceph.conf (pretty much standard): > > > > > > > > [global] > > > > fsid = 89191a54-740a-46c7-a325-0899ab32fd1d > > > > mon initial members = cephasp41,ceph-monitor41 mon host = > > > > 172.30.10.15,172.30.10.19 public network = 172.30.10.0/24 cluster > > > > network = 172.30.10.0/24 auth cluster required = cephx auth service > > > > required = cephx auth client required = cephx > > > > > > > > #Default is 1GB, which is fine for us > > > > #osd journal size = {n} > > > > > > > > #Only needed if ext4 comes to play > > > > #filestore xattr use omap = true > > > > > > > > osd pool default size = 3 # Write an object n times. > > > > osd pool default min size = 2 # Allow writing n copy in a degraded > > > > state. > > > > > > > Normally I'd say a replication of 2 is sufficient with SSDs, but > > > given your choice of SSDs I'll refrain from that. > > > > > > > #Set individual per pool by a formula > > > > #osd pool default pg num = {n} > > > > #osd pool default pgp num = {n} > > > > #osd crush chooseleaf type = {n} > > > > > > > > > > > > When I benchmark the cluster with "rbd bench-write rbd/fio" I get > > > > pretty good results: elapsed: 18 ops: 262144 ops/sec: > > > > 14466.30 bytes/sec: 59253946.11 > > > > > > > Apple and oranges time again, this time you're testing with 4K > > > blocks and 16 threads (defaults for this test). > > > > > > Incidentally, I get this from a 3 node cluster (replication 3) with 8 > > > OSDs per node (SATA disk, journals on 4 Intel DC S3700 100GB) and > > > Infiniband (4QDR) interconnect: elapsed: 7 ops: 246724 > > > ops/sec: 31157.87 bytes/sec: 135599456.06 > > > > > > > If I for example bench i.e. with fio with rbd engine, I get very > > > > poor results: > > > > > > > > [global] > > > > ioengine=rbd > > > > clientname=admin > > > > pool=rbd > > > > rbdname=fio > > > > invalidate=0 # mandatory > > > > rw=randwrite > > > > bs=512k > > > > > > > > [rbd_iodepth32] > > > > iodepth=32 > > > > > > > > RESULTS: > > > > ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec > > > > > > > Total apples and oranges time, now you're having 512KB blocks (which > > > of course will reduce IOPS) and 32 threads. The bandwidth is still > > > about the same as before and if you multiply 105x128(to compensate > > > for 4KB blocks) you wind with 13440, close to what you've seen with > > > the rbd bench. Also from where are you benching? > > > > Also if I mount the rbd with kernel as rbd0, format it with ext4 > > > > and then do a dd on it, its not that good: "dd if=/dev/zero > > > > of=tempfile bs=1M count=1024 conv=fdatasync,notrunc" RESULT: > > > > 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s > > > > > > > Mounting it where? > > > Same system that you did the other tests from? > > > > > > Did you format it w/o lazy init or waited until the lazy init > > > finished before doing the test? > > > > > > > I also tried presenting an rbd image with tgtd, mount it onto > > > > VMWare ESXi and test it in a vm, there I got only round about 50 > > > > iops with 4k, and writing sequentiell 25Mbytes. With NFS the read > > > > sequential values are good (400Mbyte/s) but writing only 25Mbyte/s. > > > > > > > Can't really comment on that, many things that could cause this and > > > I'm not an expert in either. > > > > What I tried tweaking so far: > > > > > > > I don't think that whatever you're seeing (aside from the apple and > > > oranges bit) is caused by anything you tried tweaking below. > > > > > > Regards, > > > > > > Christian > > > > Intel NIC optimazitions: > > > > /etc/sysctl.conf > > > > > > > > # Increase system file descriptor limit fs.file-max = 65535 > > > > > > > > # Increase system IP port range to allow for more concurrent > > > > connections net.ipv4.ip_local_port_range = 1024 65000 > > > > > > > > # -- 10gbe tuning from Intel ixgb driver README -- # > > > > > > > > # turn off selective ACK and timestamps net.ipv4.tcp_sack = 0 > > > > net.ipv4.tcp_timestamps = 0 > > > > > > > > # memory allocation min/pressure/max. > > > > # read buffer, write buffer, and buffer space net.ipv4.tcp_rmem = > > > > 10000000 10000000 10000000 net.ipv4.tcp_wmem = 10000000 10000000 > > > > 10000000 net.ipv4.tcp_mem = 10000000 10000000 10000000 > > > > > > > > net.core.rmem_max = 524287 > > > > net.core.wmem_max = 524287 > > > > net.core.rmem_default = 524287 > > > > net.core.wmem_default = 524287 > > > > net.core.optmem_max = 524287 > > > > net.core.netdev_max_backlog = 300000 > > > > > > > > AND > > > > > > > > setpci -v -d 8086:10fb e6.b=2e > > > > > > > > > > > > Setting tunables to firefly: > > > > ceph osd crush tunables firefly > > > > > > > > Setting scheduler to noop: > > > > This basically stopped IO on the cluster, and I had to > > > > revert it and restart some of the osds with requests stuck > > > > > > > > And I tried moving the monitor from an VM to the Hardware where the > > > > OSDs run. > > > > > > > > > > > > Any suggestions where to look, or what could cause that problem? > > > > (because I can't believe your loosing that much performance through > > > > ceph > > > > replication) > > > > > > > > Thanks in advance. > > > > > > > > If you need any info please tell me. > > > > > > > > Mit freundlichen Grüßen/Kind regards > > > > Jonas Rottmann > > > > Systems Engineer > > > > > > > > FIS-ASP Application Service Providing und IT-Outsourcing GmbH > > > > Röthleiner Weg 4 > > > > D-97506 Grafenrheinfeld > > > > Phone: +49 (9723) 9188-568 > > > > Fax: +49 (9723) 9188-600 > > > > > > > > email: j.rottmann@xxxxxxxxxx <mailto:j.rottmann@xxxxxxxxxx> web: > > > > www.fis-asp.de > > > > > > > > Geschäftsführer Robert Schuhmann > > > > Registergericht Schweinfurt HRB 3865 > > > > > > > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Fusion Communications > > http://www.gol.com/ > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com