Hi Christian, > -----Original Message----- > From: Christian Balzer [mailto:chibi@xxxxxxx] > Sent: 07 July 2016 12:57 > To: ceph-users@xxxxxxxxxxxxxx > Cc: Nick Fisk <nick@xxxxxxxxxx> > Subject: Re: multiple journals on SSD > > > Hello Nick, > > On Thu, 7 Jul 2016 09:45:58 +0100 Nick Fisk wrote: > > > Just to add if you really want to go with lots of HDD's to Journals > > then go NVME. They are not a lot more expensive than the equivalent > > SATA based 3700's, but the latency is low low low. Here is an example > > of a node I have just commissioned with 12 HDD's to one P3700 > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > sdb 0.00 0.00 68.00 0.00 8210.00 0.00 > > 241.47 0.26 3.85 3.85 0.00 2.09 14.20 > > sdd 2.50 0.00 198.50 22.00 24938.00 9422.00 > > 311.66 4.34 27.80 6.21 222.64 2.45 54.00 > > sdc 0.00 0.00 63.00 0.00 7760.00 0.00 > > 246.35 0.15 2.16 2.16 0.00 1.56 9.80 > > sda 0.00 0.00 61.50 47.00 7600.00 22424.00 > > 553.44 2.77 25.57 2.63 55.57 3.82 41.40 > > nvme0n1 0.00 22.50 2.00 2605.00 8.00 139638.00 > > 107.13 0.14 0.05 0.00 0.05 0.03 6.60 > > sdg 0.00 0.00 61.00 28.00 6230.00 12696.00 > > 425.30 3.66 74.79 5.84 225.00 3.87 34.40 > > sdf 0.00 0.00 34.50 47.00 4108.00 21702.00 > > 633.37 3.56 43.75 1.51 74.77 2.85 23.20 > > sdh 0.00 0.00 75.00 15.50 9180.00 4984.00 > > 313.02 0.45 12.55 3.28 57.42 3.51 31.80 > > sdi 1.50 0.50 142.00 48.50 18102.00 21924.00 > > 420.22 3.60 18.92 4.99 59.71 2.70 51.40 > > sdj 0.50 0.00 74.50 5.00 9362.00 1832.00 > > 281.61 0.33 4.10 3.33 15.60 2.44 19.40 > > sdk 0.00 0.00 54.00 0.00 6420.00 0.00 > > 237.78 0.12 2.30 2.30 0.00 1.70 9.20 > > sdl 0.00 0.00 21.00 1.50 2286.00 16.00 > > 204.62 0.32 18.13 13.81 78.67 6.67 15.00 > > sde 0.00 0.00 98.00 0.00 12304.00 0.00 > > 251.10 0.30 3.10 3.10 0.00 2.08 20.40 > > > Is that a live sample from iostat or the initial/one-shot summary? 1st of all, apologies for the formatting, that looks really ugly above, fixed now. Iostat had been running for a while, I just copied one of the sections, so live. > > > 50us latency at 2605 iops!!! > > > At less than 5% IOPS or 14% bandwidth capacity running more than twice as slow than the spec sheet says. ^o^ Fast, very much so. > But not mindnumbingly so. > > The real question here is, how much of that latency improvement do you see in the Ceph clients, VMs? > > I'd venture not so much, given that most latency happens in Ceph. Admittedly not much, but it's very hard to tell as its only 1/5th of the cluster. Looking at graphs in graphite, I can see the filestore journal latency is massively lower. The subop latency is somewhere between a 1/2 to 3/4 of the older nodes. At higher queue depths the NVME device is always showing at least 1ms lower latency, so it must be having a positive effect. My new cluster which should be going live in a couple of weeks, will be comprised of just these node types so I will have a better idea then. Also they will have 4x3.9Ghz CPU's which go a long way to reducing latency as well. I'm aiming for ~1ms at the client for a 4kb write. > > That all said, I'd go for a similar setup as well, if I had a dozen storage nodes or more. > But at my current cluster sizes that's too many eggs in one basket for me. Yeah, I'm only at 5 nodes, but decided that having a cold spare on hand justified the risk for the intended use (backups) > My "largest" cluster is now up to node 5, going from 4 journal SSDs for 8 HDDs to 2 journal SSDs for 12 HDDs. Woo-Woo! > > > Compared to one of the other nodes with 2 100GB S3700's, 6 disks each > > > Well, that's not really fair, is it? > > Those SSDs have a 5 times lower bandwidth, triple the write latency and the SATA bus instead of the PCIe zipway when compared to > the smallest P 3700. > > And 6 disk are a bit much for that SSD, 4 would be pushing it. > Whereas 12 HDDs for the P model are a good match, overkill really. No, good point, but it demonstrates the change in my mindset from 2 years ago that I think most newbies to Ceph also go through. Back then I was like "SSD, wow fast, they will never be a problem", then I started understand the effects of latency serialisation. The S3700's have never been above 50% utilisation as my workload is lots of very small IO's, but I quite regularly see their latency above 1ms. I guess my point was its not just a case of trying to make sure MB/s = # Disks, there are other important factors. Ceph itself adds latency, so try and eliminate it everywhere else that you can. > > Incidentally the NVMes also are 5 times more power hungry than the SSDs, must be the PCIe stuff. > > Christian > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > sda 0.00 30.50 0.00 894.50 0.00 50082.00 > > 111.98 0.36 0.41 0.00 0.41 0.20 17.80 > > sdb 0.00 9.00 0.00 551.00 0.00 32044.00 > > 116.31 0.23 0.42 0.00 0.42 0.19 10.40 > > sdc 0.00 2.00 6.50 17.50 278.00 8422.00 > > 725.00 1.08 44.92 18.46 54.74 8.08 19.40 > > sdd 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > sde 0.00 2.50 27.50 21.50 2112.00 9866.00 > > 488.90 0.59 12.04 6.91 18.60 6.53 32.00 > > sdf 0.50 0.00 50.50 0.00 6170.00 0.00 > > 244.36 0.18 4.63 4.63 0.00 2.10 10.60 > > md1 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > md0 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > sdg 0.00 1.50 32.00 386.50 3970.00 12188.00 > > 77.22 0.15 0.35 0.50 0.34 0.15 6.40 > > sdh 0.00 0.00 6.00 0.00 34.00 0.00 > > 11.33 0.07 12.67 12.67 0.00 11.00 6.60 > > sdi 0.00 0.50 1.50 19.50 6.00 8862.00 > > 844.57 0.96 45.71 33.33 46.67 6.57 13.80 > > sdj 0.00 0.00 67.00 0.00 8214.00 0.00 > > 245.19 0.17 2.51 2.51 0.00 1.88 12.60 > > sdk 1.50 2.50 61.00 48.00 6216.00 21020.00 > > 499.74 2.01 18.46 11.41 27.42 5.05 55.00 > > sdm 0.00 0.00 30.50 0.00 3576.00 0.00 > > 234.49 0.07 2.43 2.43 0.00 1.90 5.80 > > sdl 0.00 4.50 25.00 23.50 2092.00 12648.00 > > 607.84 1.36 19.42 5.60 34.13 4.04 19.60 > > sdn 0.50 0.00 23.00 0.00 2670.00 0.00 > > 232.17 0.07 2.96 2.96 0.00 2.43 5.60 > > > > Pretty much 10x the latency. I'm seriously impressed with these NVME > > things. > > > > > > > -----Original Message----- > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > > Behalf Of Christian Balzer > > > Sent: 07 July 2016 03:23 > > > To: ceph-users@xxxxxxxxxxxxxx > > > Subject: Re: multiple journals on SSD > > > > > > > > > Hello, > > > > > > I have a multitude of of problems with the benchmarks and > > > conclusions > > here, > > > more below. > > > > > > But firstly to address the question of the OP, definitely not > > > filesystem > > based > > > journals. > > > Another layer of overhead and delays, something I'd be willing to > > > ignore > > if > > > we're talking about a full SSD as OSD with an inline journal, but > > > not with journal SSDs. > > > Similar with LVM, though with a lower impact. > > > > > > Partitions really are your best bet. > > > > > > On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote: > > > > > > > Yes. > > > > > > > > On my lab (not production yet) with 9 7200 SATA (OSD) and one > > > > INTEL > > > > SSDSC2BB800G4 (800G, 9 journals) > > > > > > First and foremost, a DC 3510 with 1 DWPD endurance is not my idea > > > of good journal device, even if it had the performance. > > > If you search in the ML archives there is at least one case where > > > somebody lost a full storage node precisely because their DC S3500s > > > were worn out: > > > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28083.html > > > > > > Unless you have a read-mostly cluster, a 400GB DC S3610 (same or > > > lower > > > price) would be a better deal, at 50% more endurance and only > > > slightly > > lower > > > sequential write speed. > > > > > > And depending on your expected write volume (which you should > > > know/estimate as close as possible before buying HW), a 400GB DC > > > S3710 might be the best deal when it comes to TBW/$. > > > It's 30% more expensive than your 3510, but has the same speed and > > > an endurance that's 5 times greater. > > > > > > > during random write I got ~90% > > > > utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With > > > > linear writing it somehow worse: I got 250Mb/s on SSD, which > > > > translated to 240Mb of all OSD combined. > > > > > > > This test shows us a lot of things, mostly the failings of filestore. > > > But only partially if a SSD is a good fit for journals or not. > > > > > > How are you measuring these things on the storage node, iostat, atop? > > > At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should > > > register about/over 50% utilization, given that its top speed is 460MB/s. > > > > > > With Intel DC SSDs you can pretty much take the sequential write > > > speed > > from > > > their specifications page and roughly expect that to be the speed of > > > your journal. > > > > > > For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain > > > SATA HDDs will give us this when running "ceph tell osd.nn bench" in > > > parallel against 2 OSDs that share a journal SSD: > > > --- > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > > avgrq-sz > > avgqu-sz > > > await r_await w_await svctm %util > > > sdd 0.00 2.00 0.00 409.50 0.00 191370.75 > > 934.66 146.52 356.46 > > > 0.00 356.46 2.44 100.00 > > > sdl 0.00 85.50 0.50 120.50 2.00 49614.00 > > > 820.10 > > 2.25 18.51 > > > 0.00 18.59 8.20 99.20 > > > sdk 0.00 89.50 1.50 119.00 6.00 49348.00 > > > 819.15 > > 2.04 16.91 > > > 0.00 17.13 8.23 99.20 > > > --- > > > > > > Where sdd is the journal SSD and sdl/sdk are the OSD HDDs. > > > And the SSD is nearly at 200MB/s (and 100%). > > > For the record, that bench command is good for testing, but the result: > > > --- > > > # ceph tell osd.30 bench > > > { > > > "bytes_written": 1073741824, > > > "blocksize": 4194304, > > > "bytes_per_sec": 100960114.000000 } > > > --- > > > should be taken with a grain of salt, realistically those OSDs can > > > do > > about > > > 50MB/s sustained. > > > > > > On another cluster I have 200GB DC S3700s (360MB/s), holding 3 > > > journals > > for > > > 4 disk RAID10 (4GB HW cache Areca controller) OSDs. > > > Thus the results are more impressive: > > > --- > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > > avgrq-sz > > avgqu-sz > > > await r_await w_await svctm %util > > > sda 0.00 381.00 0.00 485.00 0.00 200374.00 > > 826.28 3.16 6.49 > > > 0.00 6.49 1.53 74.20 > > > sdb 0.00 350.50 1.00 429.00 4.00 177692.00 > > 826.49 2.78 6.46 > > > 4.00 6.46 1.53 65.60 > > > sdg 0.00 1.00 0.00 795.00 0.00 375514.50 > > 944.69 143.68 180.43 > > > 0.00 180.43 1.26 100.00 > > > --- > > > > > > Where sda/sdb are the OSD RAIDs and sdg is the journal SSD. > > > Again, a near perfect match to the Intel specifications and also an > > example > > > where the journal is the bottleneck (never mind that his cluster is > > > all > > about > > > IOPS, not throughput). > > > > > > As for the endurance mentioned above, these 200GB DC 3700s are/were > > > overkill: > > > --- > > > 233 Media_Wearout_Indicator 0x0032 098 098 000 Old_age > > > Always > > - > > > 0 > > > 241 Host_Writes_32MiB 0x0032 100 100 000 Old_age > > > Always > > - > > > 4818100 > > > 242 Host_Reads_32MiB 0x0032 100 100 000 Old_age > > > Always > > - > > > 84403 > > > --- > > > > > > Again, this cluster is all about (small) IOPS, it only sees about > > > 5MB/s > > sustained > > > I/O. > > > So a 3610 might been a better fit, but not only didn't they exist > > > back > > then, it > > > would have to be the 400GB model to match the speed, which is more > > > expensive. > > > A DC S3510 would be down 20% in terms of wearout (assuming same > > > size) and of course significantly slower. > > > With a 480GB 3510 (similar speed) it would still be about 10% worn > > > out and thus still no match for the expected life time of this cluster. > > > > > > The numbers above do correlate nicely with dd or fio tests (with 4MB > > > blocks) from VMs against the same clusters. > > > > > > > Obviously, it sucked with cold randread too (as expected). > > > > > > > Reads never touch the journal SSDs. > > > > > > > Just for comparacment, my baseline benchmark (fio/librbd, 4k, > > > > iodepth=32, randwrite) for single OSD in the pool with size=1: > > > > > > > > Intel 53x and Pro 2500 Series SSDs - 600 IOPS Intel 730 > > > Consumer models, avoid. > > > > > > > and DC S35x0/3610/3700 Series SSDs - 6605 IOPS > > > Again, what you're comparing here is only part of the picture. > > > With tests as shown above you'd see significant differences. > > > > > > > Samsung SSD 840 Series - 739 IOPS > > > Also consumer model, with impressive and unpredicted deaths reported. > > > > > > Christian > > > > > > > EDGE Boost Pro Plus 7mm - 1000 IOPS > > > > > > > > (so 3500 is clear winner) > > > > > > > > On 07/06/2016 03:22 PM, Alwin Antreich wrote: > > > > > Hi George, > > > > > > > > > > interesting result for your benchmark. May you please supply > > > > > some more numbers? As we didn't get that good of a result on our tests. > > > > > > > > > > Thanks. > > > > > > > > > > Cheers, > > > > > Alwin > > > > > > > > > > > > > > > On 07/06/2016 02:03 PM, George Shuklin wrote: > > > > >> Hello. > > > > >> > > > > >> I've been testing Intel 3500 as journal store for few HDD-based > > > > >> OSD. I stumble on issues with multiple partitions (>4) and UDEV > > > > >> (sda5, sda6,etc sometime do not appear after partition creation). > > > > >> And I'm thinking that partition is not that useful for OSD > > > > >> management, because linux do no allow partition rereading with > > > > >> it contains used volumes. > > > > >> > > > > >> So my question: How you store many journals on SSD? My initial > > > > >> thoughts: > > > > >> > > > > >> 1) filesystem with filebased journals > > > > >> 2) LVM with volumes > > > > >> > > > > >> Anything else? Best practice? > > > > >> > > > > >> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM > > > > >> HDD. _______________________________________________ > > > > >> ceph-users mailing list > > > > >> ceph-users@xxxxxxxxxxxxxx > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > > > > > ceph-users mailing list > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > -- > > > Christian Balzer Network/Systems Engineer > > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > > > http://www.gol.com/ > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com