Hi Nick, How large NVMe drives are you running per 12 disks? In my current setup I have 4xP3700 per 36 disks but I feel like I could get by with 2… Just looking for community experience :-) Cheers, Zoltan > On 07 Jul 2016, at 10:45, Nick Fisk <nick@xxxxxxxxxx> wrote: > > Just to add if you really want to go with lots of HDD's to Journals then go > NVME. They are not a lot more expensive than the equivalent SATA based > 3700's, but the latency is low low low. Here is an example of a node I have > just commissioned with 12 HDD's to one P3700 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 0.00 0.00 68.00 0.00 8210.00 0.00 241.47 > 0.26 3.85 3.85 0.00 2.09 14.20 > sdd 2.50 0.00 198.50 22.00 24938.00 9422.00 311.66 > 4.34 27.80 6.21 222.64 2.45 54.00 > sdc 0.00 0.00 63.00 0.00 7760.00 0.00 246.35 > 0.15 2.16 2.16 0.00 1.56 9.80 > sda 0.00 0.00 61.50 47.00 7600.00 22424.00 553.44 > 2.77 25.57 2.63 55.57 3.82 41.40 > nvme0n1 0.00 22.50 2.00 2605.00 8.00 139638.00 107.13 > 0.14 0.05 0.00 0.05 0.03 6.60 > sdg 0.00 0.00 61.00 28.00 6230.00 12696.00 425.30 > 3.66 74.79 5.84 225.00 3.87 34.40 > sdf 0.00 0.00 34.50 47.00 4108.00 21702.00 633.37 > 3.56 43.75 1.51 74.77 2.85 23.20 > sdh 0.00 0.00 75.00 15.50 9180.00 4984.00 313.02 > 0.45 12.55 3.28 57.42 3.51 31.80 > sdi 1.50 0.50 142.00 48.50 18102.00 21924.00 420.22 > 3.60 18.92 4.99 59.71 2.70 51.40 > sdj 0.50 0.00 74.50 5.00 9362.00 1832.00 281.61 > 0.33 4.10 3.33 15.60 2.44 19.40 > sdk 0.00 0.00 54.00 0.00 6420.00 0.00 237.78 > 0.12 2.30 2.30 0.00 1.70 9.20 > sdl 0.00 0.00 21.00 1.50 2286.00 16.00 204.62 > 0.32 18.13 13.81 78.67 6.67 15.00 > sde 0.00 0.00 98.00 0.00 12304.00 0.00 251.10 > 0.30 3.10 3.10 0.00 2.08 20.40 > > 50us latency at 2605 iops!!! > > Compared to one of the other nodes with 2 100GB S3700's, 6 disks each > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sda 0.00 30.50 0.00 894.50 0.00 50082.00 111.98 > 0.36 0.41 0.00 0.41 0.20 17.80 > sdb 0.00 9.00 0.00 551.00 0.00 32044.00 116.31 > 0.23 0.42 0.00 0.42 0.19 10.40 > sdc 0.00 2.00 6.50 17.50 278.00 8422.00 725.00 > 1.08 44.92 18.46 54.74 8.08 19.40 > sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > sde 0.00 2.50 27.50 21.50 2112.00 9866.00 488.90 > 0.59 12.04 6.91 18.60 6.53 32.00 > sdf 0.50 0.00 50.50 0.00 6170.00 0.00 244.36 > 0.18 4.63 4.63 0.00 2.10 10.60 > md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > sdg 0.00 1.50 32.00 386.50 3970.00 12188.00 77.22 > 0.15 0.35 0.50 0.34 0.15 6.40 > sdh 0.00 0.00 6.00 0.00 34.00 0.00 11.33 > 0.07 12.67 12.67 0.00 11.00 6.60 > sdi 0.00 0.50 1.50 19.50 6.00 8862.00 844.57 > 0.96 45.71 33.33 46.67 6.57 13.80 > sdj 0.00 0.00 67.00 0.00 8214.00 0.00 245.19 > 0.17 2.51 2.51 0.00 1.88 12.60 > sdk 1.50 2.50 61.00 48.00 6216.00 21020.00 499.74 > 2.01 18.46 11.41 27.42 5.05 55.00 > sdm 0.00 0.00 30.50 0.00 3576.00 0.00 234.49 > 0.07 2.43 2.43 0.00 1.90 5.80 > sdl 0.00 4.50 25.00 23.50 2092.00 12648.00 607.84 > 1.36 19.42 5.60 34.13 4.04 19.60 > sdn 0.50 0.00 23.00 0.00 2670.00 0.00 232.17 > 0.07 2.96 2.96 0.00 2.43 5.60 > > Pretty much 10x the latency. I'm seriously impressed with these NVME things. > > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of >> Christian Balzer >> Sent: 07 July 2016 03:23 >> To: ceph-users@xxxxxxxxxxxxxx >> Subject: Re: multiple journals on SSD >> >> >> Hello, >> >> I have a multitude of of problems with the benchmarks and conclusions > here, >> more below. >> >> But firstly to address the question of the OP, definitely not filesystem > based >> journals. >> Another layer of overhead and delays, something I'd be willing to ignore > if >> we're talking about a full SSD as OSD with an inline journal, but not with >> journal SSDs. >> Similar with LVM, though with a lower impact. >> >> Partitions really are your best bet. >> >> On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote: >> >>> Yes. >>> >>> On my lab (not production yet) with 9 7200 SATA (OSD) and one INTEL >>> SSDSC2BB800G4 (800G, 9 journals) >> >> First and foremost, a DC 3510 with 1 DWPD endurance is not my idea of good >> journal device, even if it had the performance. >> If you search in the ML archives there is at least one case where somebody >> lost a full storage node precisely because their DC S3500s were worn out: >> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28083.html >> >> Unless you have a read-mostly cluster, a 400GB DC S3610 (same or lower >> price) would be a better deal, at 50% more endurance and only slightly > lower >> sequential write speed. >> >> And depending on your expected write volume (which you should >> know/estimate as close as possible before buying HW), a 400GB DC S3710 >> might be the best deal when it comes to TBW/$. >> It's 30% more expensive than your 3510, but has the same speed and an >> endurance that's 5 times greater. >> >>> during random write I got ~90% >>> utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With >>> linear writing it somehow worse: I got 250Mb/s on SSD, which >>> translated to 240Mb of all OSD combined. >>> >> This test shows us a lot of things, mostly the failings of filestore. >> But only partially if a SSD is a good fit for journals or not. >> >> How are you measuring these things on the storage node, iostat, atop? >> At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should register >> about/over 50% utilization, given that its top speed is 460MB/s. >> >> With Intel DC SSDs you can pretty much take the sequential write speed > from >> their specifications page and roughly expect that to be the speed of your >> journal. >> >> For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain SATA >> HDDs will give us this when running "ceph tell osd.nn bench" in parallel >> against 2 OSDs that share a journal SSD: >> --- >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz >> await r_await w_await svctm %util >> sdd 0.00 2.00 0.00 409.50 0.00 191370.75 > 934.66 146.52 356.46 >> 0.00 356.46 2.44 100.00 >> sdl 0.00 85.50 0.50 120.50 2.00 49614.00 820.10 > 2.25 18.51 >> 0.00 18.59 8.20 99.20 >> sdk 0.00 89.50 1.50 119.00 6.00 49348.00 819.15 > 2.04 16.91 >> 0.00 17.13 8.23 99.20 >> --- >> >> Where sdd is the journal SSD and sdl/sdk are the OSD HDDs. >> And the SSD is nearly at 200MB/s (and 100%). >> For the record, that bench command is good for testing, but the result: >> --- >> # ceph tell osd.30 bench >> { >> "bytes_written": 1073741824, >> "blocksize": 4194304, >> "bytes_per_sec": 100960114.000000 >> } >> --- >> should be taken with a grain of salt, realistically those OSDs can do > about >> 50MB/s sustained. >> >> On another cluster I have 200GB DC S3700s (360MB/s), holding 3 journals > for >> 4 disk RAID10 (4GB HW cache Areca controller) OSDs. >> Thus the results are more impressive: >> --- >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz >> await r_await w_await svctm %util >> sda 0.00 381.00 0.00 485.00 0.00 200374.00 > 826.28 3.16 6.49 >> 0.00 6.49 1.53 74.20 >> sdb 0.00 350.50 1.00 429.00 4.00 177692.00 > 826.49 2.78 6.46 >> 4.00 6.46 1.53 65.60 >> sdg 0.00 1.00 0.00 795.00 0.00 375514.50 > 944.69 143.68 180.43 >> 0.00 180.43 1.26 100.00 >> --- >> >> Where sda/sdb are the OSD RAIDs and sdg is the journal SSD. >> Again, a near perfect match to the Intel specifications and also an > example >> where the journal is the bottleneck (never mind that his cluster is all > about >> IOPS, not throughput). >> >> As for the endurance mentioned above, these 200GB DC 3700s are/were >> overkill: >> --- >> 233 Media_Wearout_Indicator 0x0032 098 098 000 Old_age Always > - >> 0 >> 241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always > - >> 4818100 >> 242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always > - >> 84403 >> --- >> >> Again, this cluster is all about (small) IOPS, it only sees about 5MB/s > sustained >> I/O. >> So a 3610 might been a better fit, but not only didn't they exist back > then, it >> would have to be the 400GB model to match the speed, which is more >> expensive. >> A DC S3510 would be down 20% in terms of wearout (assuming same size) >> and of course significantly slower. >> With a 480GB 3510 (similar speed) it would still be about 10% worn out and >> thus still no match for the expected life time of this cluster. >> >> The numbers above do correlate nicely with dd or fio tests (with 4MB >> blocks) from VMs against the same clusters. >> >>> Obviously, it sucked with cold randread too (as expected). >>> >> Reads never touch the journal SSDs. >> >>> Just for comparacment, my baseline benchmark (fio/librbd, 4k, >>> iodepth=32, randwrite) for single OSD in the pool with size=1: >>> >>> Intel 53x and Pro 2500 Series SSDs - 600 IOPS Intel 730 >> Consumer models, avoid. >> >>> and DC S35x0/3610/3700 Series SSDs - 6605 IOPS >> Again, what you're comparing here is only part of the picture. >> With tests as shown above you'd see significant differences. >> >>> Samsung SSD 840 Series - 739 IOPS >> Also consumer model, with impressive and unpredicted deaths reported. >> >> Christian >> >>> EDGE Boost Pro Plus 7mm - 1000 IOPS >>> >>> (so 3500 is clear winner) >>> >>> On 07/06/2016 03:22 PM, Alwin Antreich wrote: >>>> Hi George, >>>> >>>> interesting result for your benchmark. May you please supply some >>>> more numbers? As we didn't get that good of a result on our tests. >>>> >>>> Thanks. >>>> >>>> Cheers, >>>> Alwin >>>> >>>> >>>> On 07/06/2016 02:03 PM, George Shuklin wrote: >>>>> Hello. >>>>> >>>>> I've been testing Intel 3500 as journal store for few HDD-based OSD. >>>>> I stumble on issues with multiple partitions (>4) and UDEV (sda5, >>>>> sda6,etc sometime do not appear after partition creation). And I'm >>>>> thinking that partition is not that useful for OSD management, >>>>> because linux do no allow partition rereading with it contains used >>>>> volumes. >>>>> >>>>> So my question: How you store many journals on SSD? My initial >>>>> thoughts: >>>>> >>>>> 1) filesystem with filebased journals >>>>> 2) LVM with volumes >>>>> >>>>> Anything else? Best practice? >>>>> >>>>> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD. >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> -- >> Christian Balzer Network/Systems Engineer >> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications >> http://www.gol.com/ >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com