Re: multiple journals on SSD

Nick Fisk <nick@xxxxxxxxxx> · Thu, 7 Jul 2016 09:45:58 +0100

Just to add if you really want to go with lots of HDD's to Journals then go
NVME. They are not a lot more expensive than the equivalent SATA based
3700's, but the latency is low low low. Here is an example of a node I have
just commissioned with 12 HDD's to one P3700

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00   68.00    0.00  8210.00     0.00   241.47
0.26    3.85    3.85    0.00   2.09  14.20
sdd               2.50     0.00  198.50   22.00 24938.00  9422.00   311.66
4.34   27.80    6.21  222.64   2.45  54.00
sdc               0.00     0.00   63.00    0.00  7760.00     0.00   246.35
0.15    2.16    2.16    0.00   1.56   9.80
sda               0.00     0.00   61.50   47.00  7600.00 22424.00   553.44
2.77   25.57    2.63   55.57   3.82  41.40
nvme0n1           0.00    22.50    2.00 2605.00     8.00 139638.00   107.13
0.14    0.05    0.00    0.05   0.03   6.60
sdg               0.00     0.00   61.00   28.00  6230.00 12696.00   425.30
3.66   74.79    5.84  225.00   3.87  34.40
sdf               0.00     0.00   34.50   47.00  4108.00 21702.00   633.37
3.56   43.75    1.51   74.77   2.85  23.20
sdh               0.00     0.00   75.00   15.50  9180.00  4984.00   313.02
0.45   12.55    3.28   57.42   3.51  31.80
sdi               1.50     0.50  142.00   48.50 18102.00 21924.00   420.22
3.60   18.92    4.99   59.71   2.70  51.40
sdj               0.50     0.00   74.50    5.00  9362.00  1832.00   281.61
0.33    4.10    3.33   15.60   2.44  19.40
sdk               0.00     0.00   54.00    0.00  6420.00     0.00   237.78
0.12    2.30    2.30    0.00   1.70   9.20
sdl               0.00     0.00   21.00    1.50  2286.00    16.00   204.62
0.32   18.13   13.81   78.67   6.67  15.00
sde               0.00     0.00   98.00    0.00 12304.00     0.00   251.10
0.30    3.10    3.10    0.00   2.08  20.40

50us latency at 2605 iops!!!

Compared to one of the other nodes with 2 100GB S3700's, 6 disks each

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    30.50    0.00  894.50     0.00 50082.00   111.98
0.36    0.41    0.00    0.41   0.20  17.80
sdb               0.00     9.00    0.00  551.00     0.00 32044.00   116.31
0.23    0.42    0.00    0.42   0.19  10.40
sdc               0.00     2.00    6.50   17.50   278.00  8422.00   725.00
1.08   44.92   18.46   54.74   8.08  19.40
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     2.50   27.50   21.50  2112.00  9866.00   488.90
0.59   12.04    6.91   18.60   6.53  32.00
sdf               0.50     0.00   50.50    0.00  6170.00     0.00   244.36
0.18    4.63    4.63    0.00   2.10  10.60
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00
0.00    0.00    0.00    0.00   0.00   0.00
sdg               0.00     1.50   32.00  386.50  3970.00 12188.00    77.22
0.15    0.35    0.50    0.34   0.15   6.40
sdh               0.00     0.00    6.00    0.00    34.00     0.00    11.33
0.07   12.67   12.67    0.00  11.00   6.60
sdi               0.00     0.50    1.50   19.50     6.00  8862.00   844.57
0.96   45.71   33.33   46.67   6.57  13.80
sdj               0.00     0.00   67.00    0.00  8214.00     0.00   245.19
0.17    2.51    2.51    0.00   1.88  12.60
sdk               1.50     2.50   61.00   48.00  6216.00 21020.00   499.74
2.01   18.46   11.41   27.42   5.05  55.00
sdm               0.00     0.00   30.50    0.00  3576.00     0.00   234.49
0.07    2.43    2.43    0.00   1.90   5.80
sdl               0.00     4.50   25.00   23.50  2092.00 12648.00   607.84
1.36   19.42    5.60   34.13   4.04  19.60
sdn               0.50     0.00   23.00    0.00  2670.00     0.00   232.17
0.07    2.96    2.96    0.00   2.43   5.60

Pretty much 10x the latency. I'm seriously impressed with these NVME things.

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Christian Balzer
> Sent: 07 July 2016 03:23
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  multiple journals on SSD
> 
> 
> Hello,
> 
> I have a multitude of of problems with the benchmarks and conclusions
here,
> more below.
> 
> But firstly to address the question of the OP, definitely not filesystem
based
> journals.
> Another layer of overhead and delays, something I'd be willing to ignore
if
> we're talking about a full SSD as OSD with an inline journal, but not with
> journal SSDs.
> Similar with LVM, though with a lower impact.
> 
> Partitions really are your best bet.
> 
> On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote:
> 
> > Yes.
> >
> > On my lab (not production yet) with 9 7200 SATA (OSD) and one INTEL
> > SSDSC2BB800G4 (800G, 9 journals)
> 
> First and foremost, a DC 3510 with 1 DWPD endurance is not my idea of good
> journal device, even if it had the performance.
> If you search in the ML archives there is at least one case where somebody
> lost a full storage node precisely because their DC S3500s were worn out:
> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28083.html
> 
> Unless you have a read-mostly cluster, a 400GB DC S3610 (same or lower
> price) would be a better deal, at 50% more endurance and only slightly
lower
> sequential write speed.
> 
> And depending on your expected write volume (which you should
> know/estimate as close as possible before buying HW), a 400GB DC S3710
> might be the best deal when it comes to TBW/$.
> It's 30% more expensive than your 3510, but has the same speed and an
> endurance that's 5 times greater.
> 
> > during random write I got ~90%
> > utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With
> > linear writing it somehow worse: I got 250Mb/s on SSD, which
> > translated to 240Mb of all OSD combined.
> >
> This test shows us a lot of things, mostly the failings of filestore.
> But only partially if a SSD is a good fit for journals or not.
> 
> How are you measuring these things on the storage node, iostat, atop?
> At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should register
> about/over 50% utilization, given that its top speed is 460MB/s.
> 
> With Intel DC SSDs you can pretty much take the sequential write speed
from
> their specifications page and roughly expect that to be the speed of your
> journal.
> 
> For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain SATA
> HDDs will give us this when running "ceph tell osd.nn bench" in parallel
> against 2 OSDs that share a journal SSD:
> ---
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz
> await r_await w_await  svctm  %util
> sdd               0.00     2.00    0.00  409.50     0.00 191370.75
934.66   146.52  356.46
> 0.00  356.46   2.44 100.00
> sdl               0.00    85.50    0.50  120.50     2.00 49614.00   820.10
2.25   18.51
> 0.00   18.59   8.20  99.20
> sdk               0.00    89.50    1.50  119.00     6.00 49348.00   819.15
2.04   16.91
> 0.00   17.13   8.23  99.20
> ---
> 
> Where sdd is the journal SSD and sdl/sdk are the OSD HDDs.
> And the SSD is nearly at 200MB/s (and 100%).
> For the record, that bench command is good for testing, but the result:
> ---
> # ceph tell osd.30 bench
> {
>     "bytes_written": 1073741824,
>     "blocksize": 4194304,
>     "bytes_per_sec": 100960114.000000
> }
> ---
> should be taken with a grain of salt, realistically those OSDs can do
about
> 50MB/s sustained.
> 
> On another cluster I have 200GB DC S3700s (360MB/s), holding 3 journals
for
> 4 disk RAID10 (4GB HW cache Areca controller) OSDs.
> Thus the results are more impressive:
> ---
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz
> await r_await w_await  svctm  %util
> sda               0.00   381.00    0.00  485.00     0.00 200374.00
826.28     3.16    6.49
> 0.00    6.49   1.53  74.20
> sdb               0.00   350.50    1.00  429.00     4.00 177692.00
826.49     2.78    6.46
> 4.00    6.46   1.53  65.60
> sdg               0.00     1.00    0.00  795.00     0.00 375514.50
944.69   143.68  180.43
> 0.00  180.43   1.26 100.00
> ---
> 
> Where sda/sdb are the OSD RAIDs and sdg is the journal SSD.
> Again, a near perfect match to the Intel specifications and also an
example
> where the journal is the bottleneck (never mind that his cluster is all
about
> IOPS, not throughput).
> 
> As for the endurance mentioned above, these 200GB DC 3700s are/were
> overkill:
> ---
> 233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always
-
> 0
> 241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always
-
> 4818100
> 242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always
-
> 84403
> ---
> 
> Again, this cluster is all about (small) IOPS, it only sees about 5MB/s
sustained
> I/O.
> So a 3610 might been a better fit, but not only didn't they exist back
then, it
> would have to be the 400GB model to match the speed, which is more
> expensive.
> A DC S3510 would be down 20% in terms of wearout (assuming same size)
> and of course significantly slower.
> With a 480GB 3510 (similar speed) it would still be about 10% worn out and
> thus still no match for the expected life time of this cluster.
> 
> The numbers above do correlate nicely with dd or fio tests (with 4MB
> blocks) from VMs against the same clusters.
> 
> > Obviously, it sucked with cold randread too (as expected).
> >
> Reads never touch the journal SSDs.
> 
> > Just for comparacment, my baseline benchmark (fio/librbd, 4k,
> > iodepth=32, randwrite) for single OSD in the pool with size=1:
> >
> > Intel 53x and Pro 2500 Series SSDs - 600 IOPS Intel 730
> Consumer models, avoid.
> 
> > and DC S35x0/3610/3700 Series SSDs - 6605 IOPS
> Again, what you're comparing here is only part of the picture.
> With tests as shown above you'd see significant differences.
> 
> > Samsung SSD 840 Series - 739 IOPS
> Also consumer model, with impressive and unpredicted deaths reported.
> 
> Christian
> 
> > EDGE Boost Pro Plus 7mm - 1000 IOPS
> >
> > (so 3500 is clear winner)
> >
> > On 07/06/2016 03:22 PM, Alwin Antreich wrote:
> > > Hi George,
> > >
> > > interesting result for your benchmark. May you please supply some
> > > more numbers? As we didn't get that good of a result on our tests.
> > >
> > > Thanks.
> > >
> > > Cheers,
> > > Alwin
> > >
> > >
> > > On 07/06/2016 02:03 PM, George Shuklin wrote:
> > >> Hello.
> > >>
> > >> I've been testing Intel 3500 as journal store for few HDD-based OSD.
> > >> I stumble on issues with multiple partitions (>4) and UDEV (sda5,
> > >> sda6,etc sometime do not appear after partition creation). And I'm
> > >> thinking that partition is not that useful for OSD management,
> > >> because linux do no allow partition rereading with it contains used
> > >> volumes.
> > >>
> > >> So my question: How you store many journals on SSD? My initial
> > >> thoughts:
> > >>
> > >> 1)  filesystem with filebased journals
> > >> 2) LVM with volumes
> > >>
> > >> Anything else? Best practice?
> > >>
> > >> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-users@xxxxxxxxxxxxxx
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com