Re: multiple journals on SSD

Christian Balzer <chibi@xxxxxxx> · Thu, 7 Jul 2016 20:56:40 +0900

Hello Nick,

On Thu, 7 Jul 2016 09:45:58 +0100 Nick Fisk wrote:

> Just to add if you really want to go with lots of HDD's to Journals then
> go NVME. They are not a lot more expensive than the equivalent SATA based
> 3700's, but the latency is low low low. Here is an example of a node I
> have just commissioned with 12 HDD's to one P3700
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdb               0.00     0.00   68.00    0.00  8210.00     0.00
> 241.47 0.26    3.85    3.85    0.00   2.09  14.20
> sdd               2.50     0.00  198.50   22.00 24938.00  9422.00
> 311.66 4.34   27.80    6.21  222.64   2.45  54.00
> sdc               0.00     0.00   63.00    0.00  7760.00     0.00
> 246.35 0.15    2.16    2.16    0.00   1.56   9.80
> sda               0.00     0.00   61.50   47.00  7600.00 22424.00
> 553.44 2.77   25.57    2.63   55.57   3.82  41.40
> nvme0n1           0.00    22.50    2.00 2605.00     8.00 139638.00
> 107.13 0.14    0.05    0.00    0.05   0.03   6.60
> sdg               0.00     0.00   61.00   28.00  6230.00 12696.00
> 425.30 3.66   74.79    5.84  225.00   3.87  34.40
> sdf               0.00     0.00   34.50   47.00  4108.00 21702.00
> 633.37 3.56   43.75    1.51   74.77   2.85  23.20
> sdh               0.00     0.00   75.00   15.50  9180.00  4984.00
> 313.02 0.45   12.55    3.28   57.42   3.51  31.80
> sdi               1.50     0.50  142.00   48.50 18102.00 21924.00
> 420.22 3.60   18.92    4.99   59.71   2.70  51.40
> sdj               0.50     0.00   74.50    5.00  9362.00  1832.00
> 281.61 0.33    4.10    3.33   15.60   2.44  19.40
> sdk               0.00     0.00   54.00    0.00  6420.00     0.00
> 237.78 0.12    2.30    2.30    0.00   1.70   9.20
> sdl               0.00     0.00   21.00    1.50  2286.00    16.00
> 204.62 0.32   18.13   13.81   78.67   6.67  15.00
> sde               0.00     0.00   98.00    0.00 12304.00     0.00
> 251.10 0.30    3.10    3.10    0.00   2.08  20.40
> 
Is that a live sample from iostat or the initial/one-shot summary?

> 50us latency at 2605 iops!!!
>
At less than 5% IOPS or 14% bandwidth capacity running more than twice as
slow than the spec sheet says. ^o^
Fast, very much so. But not mindnumbingly so. 

The real question here is, how much of that latency improvement do you see
in the Ceph clients, VMs?

I'd venture not so much, given that most latency happens in Ceph.

That all said, I'd go for a similar setup as well, if I had a dozen
storage nodes or more. 
But at my current cluster sizes that's too many eggs in one basket for me.
My "largest" cluster is now up to node 5, going from 4 journal SSDs for 8
HDDs to 2 journal SSDs for 12 HDDs. Woo-Woo!

> Compared to one of the other nodes with 2 100GB S3700's, 6 disks each
> 
Well, that's not really fair, is it?

Those SSDs have a 5 times lower bandwidth, triple the write latency and the
SATA bus instead of the PCIe zipway when compared to the smallest P 3700.

And 6 disk are a bit much for that SSD, 4 would be pushing it.
Whereas 12 HDDs for the P model are a good match, overkill really.

Incidentally the NVMes also are 5 times more power hungry than the SSDs,
must be the PCIe stuff.

Christian

> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00    30.50    0.00  894.50     0.00 50082.00
> 111.98 0.36    0.41    0.00    0.41   0.20  17.80
> sdb               0.00     9.00    0.00  551.00     0.00 32044.00
> 116.31 0.23    0.42    0.00    0.42   0.19  10.40
> sdc               0.00     2.00    6.50   17.50   278.00  8422.00
> 725.00 1.08   44.92   18.46   54.74   8.08  19.40
> sdd               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00 0.00    0.00    0.00    0.00   0.00   0.00
> sde               0.00     2.50   27.50   21.50  2112.00  9866.00
> 488.90 0.59   12.04    6.91   18.60   6.53  32.00
> sdf               0.50     0.00   50.50    0.00  6170.00     0.00
> 244.36 0.18    4.63    4.63    0.00   2.10  10.60
> md1               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00 0.00    0.00    0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00 0.00    0.00    0.00    0.00   0.00   0.00
> sdg               0.00     1.50   32.00  386.50  3970.00 12188.00
> 77.22 0.15    0.35    0.50    0.34   0.15   6.40
> sdh               0.00     0.00    6.00    0.00    34.00     0.00
> 11.33 0.07   12.67   12.67    0.00  11.00   6.60
> sdi               0.00     0.50    1.50   19.50     6.00  8862.00
> 844.57 0.96   45.71   33.33   46.67   6.57  13.80
> sdj               0.00     0.00   67.00    0.00  8214.00     0.00
> 245.19 0.17    2.51    2.51    0.00   1.88  12.60
> sdk               1.50     2.50   61.00   48.00  6216.00 21020.00
> 499.74 2.01   18.46   11.41   27.42   5.05  55.00
> sdm               0.00     0.00   30.50    0.00  3576.00     0.00
> 234.49 0.07    2.43    2.43    0.00   1.90   5.80
> sdl               0.00     4.50   25.00   23.50  2092.00 12648.00
> 607.84 1.36   19.42    5.60   34.13   4.04  19.60
> sdn               0.50     0.00   23.00    0.00  2670.00     0.00
> 232.17 0.07    2.96    2.96    0.00   2.43   5.60
> 
> Pretty much 10x the latency. I'm seriously impressed with these NVME
> things.
> 
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Christian Balzer
> > Sent: 07 July 2016 03:23
> > To: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  multiple journals on SSD
> > 
> > 
> > Hello,
> > 
> > I have a multitude of of problems with the benchmarks and conclusions
> here,
> > more below.
> > 
> > But firstly to address the question of the OP, definitely not
> > filesystem
> based
> > journals.
> > Another layer of overhead and delays, something I'd be willing to
> > ignore
> if
> > we're talking about a full SSD as OSD with an inline journal, but not
> > with journal SSDs.
> > Similar with LVM, though with a lower impact.
> > 
> > Partitions really are your best bet.
> > 
> > On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote:
> > 
> > > Yes.
> > >
> > > On my lab (not production yet) with 9 7200 SATA (OSD) and one INTEL
> > > SSDSC2BB800G4 (800G, 9 journals)
> > 
> > First and foremost, a DC 3510 with 1 DWPD endurance is not my idea of
> > good journal device, even if it had the performance.
> > If you search in the ML archives there is at least one case where
> > somebody lost a full storage node precisely because their DC S3500s
> > were worn out:
> > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28083.html
> > 
> > Unless you have a read-mostly cluster, a 400GB DC S3610 (same or lower
> > price) would be a better deal, at 50% more endurance and only slightly
> lower
> > sequential write speed.
> > 
> > And depending on your expected write volume (which you should
> > know/estimate as close as possible before buying HW), a 400GB DC S3710
> > might be the best deal when it comes to TBW/$.
> > It's 30% more expensive than your 3510, but has the same speed and an
> > endurance that's 5 times greater.
> > 
> > > during random write I got ~90%
> > > utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With
> > > linear writing it somehow worse: I got 250Mb/s on SSD, which
> > > translated to 240Mb of all OSD combined.
> > >
> > This test shows us a lot of things, mostly the failings of filestore.
> > But only partially if a SSD is a good fit for journals or not.
> > 
> > How are you measuring these things on the storage node, iostat, atop?
> > At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should register
> > about/over 50% utilization, given that its top speed is 460MB/s.
> > 
> > With Intel DC SSDs you can pretty much take the sequential write speed
> from
> > their specifications page and roughly expect that to be the speed of
> > your journal.
> > 
> > For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain
> > SATA HDDs will give us this when running "ceph tell osd.nn bench" in
> > parallel against 2 OSDs that share a journal SSD:
> > ---
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > avgrq-sz
> avgqu-sz
> > await r_await w_await  svctm  %util
> > sdd               0.00     2.00    0.00  409.50     0.00 191370.75
> 934.66   146.52  356.46
> > 0.00  356.46   2.44 100.00
> > sdl               0.00    85.50    0.50  120.50     2.00 49614.00
> > 820.10
> 2.25   18.51
> > 0.00   18.59   8.20  99.20
> > sdk               0.00    89.50    1.50  119.00     6.00 49348.00
> > 819.15
> 2.04   16.91
> > 0.00   17.13   8.23  99.20
> > ---
> > 
> > Where sdd is the journal SSD and sdl/sdk are the OSD HDDs.
> > And the SSD is nearly at 200MB/s (and 100%).
> > For the record, that bench command is good for testing, but the result:
> > ---
> > # ceph tell osd.30 bench
> > {
> >     "bytes_written": 1073741824,
> >     "blocksize": 4194304,
> >     "bytes_per_sec": 100960114.000000
> > }
> > ---
> > should be taken with a grain of salt, realistically those OSDs can do
> about
> > 50MB/s sustained.
> > 
> > On another cluster I have 200GB DC S3700s (360MB/s), holding 3 journals
> for
> > 4 disk RAID10 (4GB HW cache Areca controller) OSDs.
> > Thus the results are more impressive:
> > ---
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > avgrq-sz
> avgqu-sz
> > await r_await w_await  svctm  %util
> > sda               0.00   381.00    0.00  485.00     0.00 200374.00
> 826.28     3.16    6.49
> > 0.00    6.49   1.53  74.20
> > sdb               0.00   350.50    1.00  429.00     4.00 177692.00
> 826.49     2.78    6.46
> > 4.00    6.46   1.53  65.60
> > sdg               0.00     1.00    0.00  795.00     0.00 375514.50
> 944.69   143.68  180.43
> > 0.00  180.43   1.26 100.00
> > ---
> > 
> > Where sda/sdb are the OSD RAIDs and sdg is the journal SSD.
> > Again, a near perfect match to the Intel specifications and also an
> example
> > where the journal is the bottleneck (never mind that his cluster is all
> about
> > IOPS, not throughput).
> > 
> > As for the endurance mentioned above, these 200GB DC 3700s are/were
> > overkill:
> > ---
> > 233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age
> > Always
> -
> > 0
> > 241 Host_Writes_32MiB       0x0032   100   100   000    Old_age
> > Always
> -
> > 4818100
> > 242 Host_Reads_32MiB        0x0032   100   100   000    Old_age
> > Always
> -
> > 84403
> > ---
> > 
> > Again, this cluster is all about (small) IOPS, it only sees about 5MB/s
> sustained
> > I/O.
> > So a 3610 might been a better fit, but not only didn't they exist back
> then, it
> > would have to be the 400GB model to match the speed, which is more
> > expensive.
> > A DC S3510 would be down 20% in terms of wearout (assuming same size)
> > and of course significantly slower.
> > With a 480GB 3510 (similar speed) it would still be about 10% worn out
> > and thus still no match for the expected life time of this cluster.
> > 
> > The numbers above do correlate nicely with dd or fio tests (with 4MB
> > blocks) from VMs against the same clusters.
> > 
> > > Obviously, it sucked with cold randread too (as expected).
> > >
> > Reads never touch the journal SSDs.
> > 
> > > Just for comparacment, my baseline benchmark (fio/librbd, 4k,
> > > iodepth=32, randwrite) for single OSD in the pool with size=1:
> > >
> > > Intel 53x and Pro 2500 Series SSDs - 600 IOPS Intel 730
> > Consumer models, avoid.
> > 
> > > and DC S35x0/3610/3700 Series SSDs - 6605 IOPS
> > Again, what you're comparing here is only part of the picture.
> > With tests as shown above you'd see significant differences.
> > 
> > > Samsung SSD 840 Series - 739 IOPS
> > Also consumer model, with impressive and unpredicted deaths reported.
> > 
> > Christian
> > 
> > > EDGE Boost Pro Plus 7mm - 1000 IOPS
> > >
> > > (so 3500 is clear winner)
> > >
> > > On 07/06/2016 03:22 PM, Alwin Antreich wrote:
> > > > Hi George,
> > > >
> > > > interesting result for your benchmark. May you please supply some
> > > > more numbers? As we didn't get that good of a result on our tests.
> > > >
> > > > Thanks.
> > > >
> > > > Cheers,
> > > > Alwin
> > > >
> > > >
> > > > On 07/06/2016 02:03 PM, George Shuklin wrote:
> > > >> Hello.
> > > >>
> > > >> I've been testing Intel 3500 as journal store for few HDD-based
> > > >> OSD. I stumble on issues with multiple partitions (>4) and UDEV
> > > >> (sda5, sda6,etc sometime do not appear after partition creation).
> > > >> And I'm thinking that partition is not that useful for OSD
> > > >> management, because linux do no allow partition rereading with it
> > > >> contains used volumes.
> > > >>
> > > >> So my question: How you store many journals on SSD? My initial
> > > >> thoughts:
> > > >>
> > > >> 1)  filesystem with filebased journals
> > > >> 2) LVM with volumes
> > > >>
> > > >> Anything else? Best practice?
> > > >>
> > > >> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM
> > > >> HDD. _______________________________________________
> > > >> ceph-users mailing list
> > > >> ceph-users@xxxxxxxxxxxxxx
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > 
> > 
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com