Hello, I have a multitude of of problems with the benchmarks and conclusions here, more below. But firstly to address the question of the OP, definitely not filesystem based journals. Another layer of overhead and delays, something I'd be willing to ignore if we're talking about a full SSD as OSD with an inline journal, but not with journal SSDs. Similar with LVM, though with a lower impact. Partitions really are your best bet. On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote: > Yes. > > On my lab (not production yet) with 9 7200 SATA (OSD) and one INTEL > SSDSC2BB800G4 (800G, 9 journals) First and foremost, a DC 3510 with 1 DWPD endurance is not my idea of good journal device, even if it had the performance. If you search in the ML archives there is at least one case where somebody lost a full storage node precisely because their DC S3500s were worn out: https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28083.html Unless you have a read-mostly cluster, a 400GB DC S3610 (same or lower price) would be a better deal, at 50% more endurance and only slightly lower sequential write speed. And depending on your expected write volume (which you should know/estimate as close as possible before buying HW), a 400GB DC S3710 might be the best deal when it comes to TBW/$. It's 30% more expensive than your 3510, but has the same speed and an endurance that's 5 times greater. > during random write I got ~90% > utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With > linear writing it somehow worse: I got 250Mb/s on SSD, which translated > to 240Mb of all OSD combined. > This test shows us a lot of things, mostly the failings of filestore. But only partially if a SSD is a good fit for journals or not. How are you measuring these things on the storage node, iostat, atop? At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should register about/over 50% utilization, given that its top speed is 460MB/s. With Intel DC SSDs you can pretty much take the sequential write speed from their specifications page and roughly expect that to be the speed of your journal. For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain SATA HDDs will give us this when running "ceph tell osd.nn bench" in parallel against 2 OSDs that share a journal SSD: --- Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 2.00 0.00 409.50 0.00 191370.75 934.66 146.52 356.46 0.00 356.46 2.44 100.00 sdl 0.00 85.50 0.50 120.50 2.00 49614.00 820.10 2.25 18.51 0.00 18.59 8.20 99.20 sdk 0.00 89.50 1.50 119.00 6.00 49348.00 819.15 2.04 16.91 0.00 17.13 8.23 99.20 --- Where sdd is the journal SSD and sdl/sdk are the OSD HDDs. And the SSD is nearly at 200MB/s (and 100%). For the record, that bench command is good for testing, but the result: --- # ceph tell osd.30 bench { "bytes_written": 1073741824, "blocksize": 4194304, "bytes_per_sec": 100960114.000000 } --- should be taken with a grain of salt, realistically those OSDs can do about 50MB/s sustained. On another cluster I have 200GB DC S3700s (360MB/s), holding 3 journals for 4 disk RAID10 (4GB HW cache Areca controller) OSDs. Thus the results are more impressive: --- Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 381.00 0.00 485.00 0.00 200374.00 826.28 3.16 6.49 0.00 6.49 1.53 74.20 sdb 0.00 350.50 1.00 429.00 4.00 177692.00 826.49 2.78 6.46 4.00 6.46 1.53 65.60 sdg 0.00 1.00 0.00 795.00 0.00 375514.50 944.69 143.68 180.43 0.00 180.43 1.26 100.00 --- Where sda/sdb are the OSD RAIDs and sdg is the journal SSD. Again, a near perfect match to the Intel specifications and also an example where the journal is the bottleneck (never mind that his cluster is all about IOPS, not throughput). As for the endurance mentioned above, these 200GB DC 3700s are/were overkill: --- 233 Media_Wearout_Indicator 0x0032 098 098 000 Old_age Always - 0 241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 4818100 242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 84403 --- Again, this cluster is all about (small) IOPS, it only sees about 5MB/s sustained I/O. So a 3610 might been a better fit, but not only didn't they exist back then, it would have to be the 400GB model to match the speed, which is more expensive. A DC S3510 would be down 20% in terms of wearout (assuming same size) and of course significantly slower. With a 480GB 3510 (similar speed) it would still be about 10% worn out and thus still no match for the expected life time of this cluster. The numbers above do correlate nicely with dd or fio tests (with 4MB blocks) from VMs against the same clusters. > Obviously, it sucked with cold randread too (as expected). > Reads never touch the journal SSDs. > Just for comparacment, my baseline benchmark (fio/librbd, 4k, > iodepth=32, randwrite) for single OSD in the pool with size=1: > > Intel 53x and Pro 2500 Series SSDs - 600 IOPS > Intel 730 Consumer models, avoid. > and DC S35x0/3610/3700 Series SSDs - 6605 IOPS Again, what you're comparing here is only part of the picture. With tests as shown above you'd see significant differences. > Samsung SSD 840 Series - 739 IOPS Also consumer model, with impressive and unpredicted deaths reported. Christian > EDGE Boost Pro Plus 7mm - 1000 IOPS > > (so 3500 is clear winner) > > On 07/06/2016 03:22 PM, Alwin Antreich wrote: > > Hi George, > > > > interesting result for your benchmark. May you please supply some more > > numbers? As we didn't get that good of a result on our tests. > > > > Thanks. > > > > Cheers, > > Alwin > > > > > > On 07/06/2016 02:03 PM, George Shuklin wrote: > >> Hello. > >> > >> I've been testing Intel 3500 as journal store for few HDD-based OSD. > >> I stumble on issues with multiple partitions (>4) and UDEV (sda5, > >> sda6,etc sometime do not appear after partition creation). And I'm > >> thinking that partition is not that useful for OSD management, > >> because linux do no allow partition rereading with it contains used > >> volumes. > >> > >> So my question: How you store many journals on SSD? My initial > >> thoughts: > >> > >> 1) filesystem with filebased journals > >> 2) LVM with volumes > >> > >> Anything else? Best practice? > >> > >> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD. > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com