Re: multiple journals on SSD

Christian Balzer <chibi@xxxxxxx> · Thu, 7 Jul 2016 11:22:52 +0900

Hello,

I have a multitude of of problems with the benchmarks and conclusions
here, more below.

But firstly to address the question of the OP, definitely not filesystem
based journals. 
Another layer of overhead and delays, something I'd be willing to ignore
if we're talking about a full SSD as OSD with an inline journal, but not
with journal SSDs.
Similar with LVM, though with a lower impact.

Partitions really are your best bet.

On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote:

> Yes.
> 
> On my lab (not production yet) with 9 7200 SATA (OSD) and one INTEL 
> SSDSC2BB800G4 (800G, 9 journals) 

First and foremost, a DC 3510 with 1 DWPD endurance is not my idea of good
journal device, even if it had the performance.
If you search in the ML archives there is at least one case where somebody
lost a full storage node precisely because their DC S3500s were worn out:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28083.html

Unless you have a read-mostly cluster, a 400GB DC S3610 (same or lower
price) would be a better deal, at 50% more endurance and only slightly
lower sequential write speed.

And depending on your expected write volume (which you should
know/estimate as close as possible before buying HW), a 400GB DC S3710
might be the best deal when it comes to TBW/$.
It's 30% more expensive than your 3510, but has the same speed and an
endurance that's 5 times greater.

> during random write I got ~90% 
> utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With 
> linear writing it somehow worse: I got 250Mb/s on SSD, which translated 
> to 240Mb of all OSD combined.
> 
This test shows us a lot of things, mostly the failings of filestore.
But only partially if a SSD is a good fit for journals or not.

How are you measuring these things on the storage node, iostat, atop?
At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should register
about/over 50% utilization, given that its top speed is 460MB/s.

With Intel DC SSDs you can pretty much take the sequential write speed
from their specifications page and roughly expect that to be the speed of
your journal.

For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain SATA
HDDs will give us this when running "ceph tell osd.nn bench" in
parallel against 2 OSDs that share a journal SSD:
---
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00     2.00    0.00  409.50     0.00 191370.75   934.66   146.52  356.46    0.00  356.46   2.44 100.00
sdl               0.00    85.50    0.50  120.50     2.00 49614.00   820.10     2.25   18.51    0.00   18.59   8.20  99.20
sdk               0.00    89.50    1.50  119.00     6.00 49348.00   819.15     2.04   16.91    0.00   17.13   8.23  99.20
---

Where sdd is the journal SSD and sdl/sdk are the OSD HDDs.
And the SSD is nearly at 200MB/s (and 100%).
For the record, that bench command is good for testing, but the result:
---
# ceph tell osd.30 bench 
{
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "bytes_per_sec": 100960114.000000
}
---
should be taken with a grain of salt, realistically those OSDs can do
about 50MB/s sustained.

On another cluster I have 200GB DC S3700s (360MB/s), holding 3 journals
for 4 disk RAID10 (4GB HW cache Areca controller) OSDs.
Thus the results are more impressive:
---
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00   381.00    0.00  485.00     0.00 200374.00   826.28     3.16    6.49    0.00    6.49   1.53  74.20
sdb               0.00   350.50    1.00  429.00     4.00 177692.00   826.49     2.78    6.46    4.00    6.46   1.53  65.60
sdg               0.00     1.00    0.00  795.00     0.00 375514.50   944.69   143.68  180.43    0.00  180.43   1.26 100.00
---

Where sda/sdb are the OSD RAIDs and sdg is the journal SSD.
Again, a near perfect match to the Intel specifications and also an
example where the journal is the bottleneck (never mind that his cluster
is all about IOPS, not throughput).

As for the endurance mentioned above, these 200GB DC 3700s are/were
overkill:
---
233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       4818100
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       84403
---

Again, this cluster is all about (small) IOPS, it only sees about 5MB/s
sustained I/O. 
So a 3610 might been a better fit, but not only didn't they exist back
then, it would have to be the 400GB model to match the speed, which is
more expensive.
A DC S3510 would be down 20% in terms of wearout (assuming same size) and
of course significantly slower.
With a 480GB 3510 (similar speed) it would still be about 10% worn out and
thus still no match for the expected life time of this cluster.

The numbers above do correlate nicely with dd or fio tests (with 4MB
blocks) from VMs against the same clusters.

> Obviously, it sucked with cold randread too (as expected).
> 
Reads never touch the journal SSDs.

> Just for comparacment, my baseline benchmark (fio/librbd, 4k, 
> iodepth=32, randwrite) for single OSD in the pool with size=1:
> 
> Intel 53x and Pro 2500 Series SSDs - 600 IOPS
> Intel 730 
Consumer models, avoid.

> and DC S35x0/3610/3700 Series SSDs - 6605 IOPS
Again, what you're comparing here is only part of the picture.
With tests as shown above you'd see significant differences.

> Samsung SSD 840 Series - 739 IOPS
Also consumer model, with impressive and unpredicted deaths reported.

Christian

> EDGE Boost Pro Plus 7mm - 1000 IOPS
> 
> (so 3500 is clear winner)
> 
> On 07/06/2016 03:22 PM, Alwin Antreich wrote:
> > Hi George,
> >
> > interesting result for your benchmark. May you please supply some more
> > numbers? As we didn't get that good of a result on our tests.
> >
> > Thanks.
> >
> > Cheers,
> > Alwin
> >
> >
> > On 07/06/2016 02:03 PM, George Shuklin wrote:
> >> Hello.
> >>
> >> I've been testing Intel 3500 as journal store for few HDD-based OSD.
> >> I stumble on issues with multiple partitions (>4) and UDEV (sda5,
> >> sda6,etc sometime do not appear after partition creation). And I'm
> >> thinking that partition is not that useful for OSD management,
> >> because linux do no allow partition rereading with it contains used
> >> volumes.
> >>
> >> So my question: How you store many journals on SSD? My initial
> >> thoughts:
> >>
> >> 1)  filesystem with filebased journals
> >> 2) LVM with volumes
> >>
> >> Anything else? Best practice?
> >>
> >> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com