Re: multiple journals on SSD

Nick Fisk <nick@xxxxxxxxxx> · Thu, 7 Jul 2016 13:58:10 +0100

Hi Christian,

> -----Original Message-----
> From: Christian Balzer [mailto:chibi@xxxxxxx]
> Sent: 07 July 2016 12:57
> To: ceph-users@xxxxxxxxxxxxxx
> Cc: Nick Fisk <nick@xxxxxxxxxx>
> Subject: Re:  multiple journals on SSD
> 
> 
> Hello Nick,
> 
> On Thu, 7 Jul 2016 09:45:58 +0100 Nick Fisk wrote:
> 
> > Just to add if you really want to go with lots of HDD's to Journals
> > then go NVME. They are not a lot more expensive than the equivalent
> > SATA based 3700's, but the latency is low low low. Here is an example
> > of a node I have just commissioned with 12 HDD's to one P3700
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sdb               0.00     0.00   68.00    0.00  8210.00     0.00
> > 241.47 0.26    3.85    3.85    0.00   2.09  14.20
> > sdd               2.50     0.00  198.50   22.00 24938.00  9422.00
> > 311.66 4.34   27.80    6.21  222.64   2.45  54.00
> > sdc               0.00     0.00   63.00    0.00  7760.00     0.00
> > 246.35 0.15    2.16    2.16    0.00   1.56   9.80
> > sda               0.00     0.00   61.50   47.00  7600.00 22424.00
> > 553.44 2.77   25.57    2.63   55.57   3.82  41.40
> > nvme0n1           0.00    22.50    2.00 2605.00     8.00 139638.00
> > 107.13 0.14    0.05    0.00    0.05   0.03   6.60
> > sdg               0.00     0.00   61.00   28.00  6230.00 12696.00
> > 425.30 3.66   74.79    5.84  225.00   3.87  34.40
> > sdf               0.00     0.00   34.50   47.00  4108.00 21702.00
> > 633.37 3.56   43.75    1.51   74.77   2.85  23.20
> > sdh               0.00     0.00   75.00   15.50  9180.00  4984.00
> > 313.02 0.45   12.55    3.28   57.42   3.51  31.80
> > sdi               1.50     0.50  142.00   48.50 18102.00 21924.00
> > 420.22 3.60   18.92    4.99   59.71   2.70  51.40
> > sdj               0.50     0.00   74.50    5.00  9362.00  1832.00
> > 281.61 0.33    4.10    3.33   15.60   2.44  19.40
> > sdk               0.00     0.00   54.00    0.00  6420.00     0.00
> > 237.78 0.12    2.30    2.30    0.00   1.70   9.20
> > sdl               0.00     0.00   21.00    1.50  2286.00    16.00
> > 204.62 0.32   18.13   13.81   78.67   6.67  15.00
> > sde               0.00     0.00   98.00    0.00 12304.00     0.00
> > 251.10 0.30    3.10    3.10    0.00   2.08  20.40
> >
> Is that a live sample from iostat or the initial/one-shot summary?

1st of all, apologies for the formatting, that looks really ugly above, fixed now. Iostat had been running for a while, I just copied one of the sections, so live.

> 
> > 50us latency at 2605 iops!!!
> >
> At less than 5% IOPS or 14% bandwidth capacity running more than twice as slow than the spec sheet says. ^o^ Fast, very much so.
> But not mindnumbingly so.
> 
> The real question here is, how much of that latency improvement do you see in the Ceph clients, VMs?
> 
> I'd venture not so much, given that most latency happens in Ceph.

Admittedly not much, but it's very hard to tell as its only 1/5th of the cluster. Looking at graphs in graphite, I can see the filestore journal latency is massively lower. The subop latency is somewhere between a 1/2 to 3/4 of the older nodes. At higher queue depths the NVME device is always showing at least 1ms lower latency, so it must be having a positive effect.

My new cluster which should be going live in a couple of weeks, will be comprised of just these node types so I will have a better idea then. Also they will have 4x3.9Ghz CPU's which go a long way to reducing latency as well. I'm aiming for ~1ms at the client for a 4kb write.

> 
> That all said, I'd go for a similar setup as well, if I had a dozen storage nodes or more.
> But at my current cluster sizes that's too many eggs in one basket for me.

Yeah, I'm only at 5 nodes, but decided that having a cold spare on hand justified the risk for the intended use (backups)

> My "largest" cluster is now up to node 5, going from 4 journal SSDs for 8 HDDs to 2 journal SSDs for 12 HDDs. Woo-Woo!
> 
> > Compared to one of the other nodes with 2 100GB S3700's, 6 disks each
> >
> Well, that's not really fair, is it?
> 
> Those SSDs have a 5 times lower bandwidth, triple the write latency and the SATA bus instead of the PCIe zipway when compared to
> the smallest P 3700.
> 
> And 6 disk are a bit much for that SSD, 4 would be pushing it.
> Whereas 12 HDDs for the P model are a good match, overkill really.

No, good point, but it demonstrates the change in my mindset from 2 years ago that I think most newbies to Ceph also go through. Back then I was like "SSD, wow fast, they will never be a problem", then I started understand the effects of latency serialisation. The S3700's have never been above 50% utilisation as my workload is lots of very small IO's, but I quite regularly see their latency above 1ms. I guess my point was its not just a case of trying to make sure MB/s = # Disks, there are other important factors. 

Ceph itself adds latency, so try and eliminate it everywhere else that you can.

> 
> Incidentally the NVMes also are 5 times more power hungry than the SSDs, must be the PCIe stuff.
> 
> Christian
> 
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sda               0.00    30.50    0.00  894.50     0.00 50082.00
> > 111.98 0.36    0.41    0.00    0.41   0.20  17.80
> > sdb               0.00     9.00    0.00  551.00     0.00 32044.00
> > 116.31 0.23    0.42    0.00    0.42   0.19  10.40
> > sdc               0.00     2.00    6.50   17.50   278.00  8422.00
> > 725.00 1.08   44.92   18.46   54.74   8.08  19.40
> > sdd               0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00 0.00    0.00    0.00    0.00   0.00   0.00
> > sde               0.00     2.50   27.50   21.50  2112.00  9866.00
> > 488.90 0.59   12.04    6.91   18.60   6.53  32.00
> > sdf               0.50     0.00   50.50    0.00  6170.00     0.00
> > 244.36 0.18    4.63    4.63    0.00   2.10  10.60
> > md1               0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00 0.00    0.00    0.00    0.00   0.00   0.00
> > md0               0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00 0.00    0.00    0.00    0.00   0.00   0.00
> > sdg               0.00     1.50   32.00  386.50  3970.00 12188.00
> > 77.22 0.15    0.35    0.50    0.34   0.15   6.40
> > sdh               0.00     0.00    6.00    0.00    34.00     0.00
> > 11.33 0.07   12.67   12.67    0.00  11.00   6.60
> > sdi               0.00     0.50    1.50   19.50     6.00  8862.00
> > 844.57 0.96   45.71   33.33   46.67   6.57  13.80
> > sdj               0.00     0.00   67.00    0.00  8214.00     0.00
> > 245.19 0.17    2.51    2.51    0.00   1.88  12.60
> > sdk               1.50     2.50   61.00   48.00  6216.00 21020.00
> > 499.74 2.01   18.46   11.41   27.42   5.05  55.00
> > sdm               0.00     0.00   30.50    0.00  3576.00     0.00
> > 234.49 0.07    2.43    2.43    0.00   1.90   5.80
> > sdl               0.00     4.50   25.00   23.50  2092.00 12648.00
> > 607.84 1.36   19.42    5.60   34.13   4.04  19.60
> > sdn               0.50     0.00   23.00    0.00  2670.00     0.00
> > 232.17 0.07    2.96    2.96    0.00   2.43   5.60
> >
> > Pretty much 10x the latency. I'm seriously impressed with these NVME
> > things.
> >
> >
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > Behalf Of Christian Balzer
> > > Sent: 07 July 2016 03:23
> > > To: ceph-users@xxxxxxxxxxxxxx
> > > Subject: Re:  multiple journals on SSD
> > >
> > >
> > > Hello,
> > >
> > > I have a multitude of of problems with the benchmarks and
> > > conclusions
> > here,
> > > more below.
> > >
> > > But firstly to address the question of the OP, definitely not
> > > filesystem
> > based
> > > journals.
> > > Another layer of overhead and delays, something I'd be willing to
> > > ignore
> > if
> > > we're talking about a full SSD as OSD with an inline journal, but
> > > not with journal SSDs.
> > > Similar with LVM, though with a lower impact.
> > >
> > > Partitions really are your best bet.
> > >
> > > On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote:
> > >
> > > > Yes.
> > > >
> > > > On my lab (not production yet) with 9 7200 SATA (OSD) and one
> > > > INTEL
> > > > SSDSC2BB800G4 (800G, 9 journals)
> > >
> > > First and foremost, a DC 3510 with 1 DWPD endurance is not my idea
> > > of good journal device, even if it had the performance.
> > > If you search in the ML archives there is at least one case where
> > > somebody lost a full storage node precisely because their DC S3500s
> > > were worn out:
> > > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28083.html
> > >
> > > Unless you have a read-mostly cluster, a 400GB DC S3610 (same or
> > > lower
> > > price) would be a better deal, at 50% more endurance and only
> > > slightly
> > lower
> > > sequential write speed.
> > >
> > > And depending on your expected write volume (which you should
> > > know/estimate as close as possible before buying HW), a 400GB DC
> > > S3710 might be the best deal when it comes to TBW/$.
> > > It's 30% more expensive than your 3510, but has the same speed and
> > > an endurance that's 5 times greater.
> > >
> > > > during random write I got ~90%
> > > > utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With
> > > > linear writing it somehow worse: I got 250Mb/s on SSD, which
> > > > translated to 240Mb of all OSD combined.
> > > >
> > > This test shows us a lot of things, mostly the failings of filestore.
> > > But only partially if a SSD is a good fit for journals or not.
> > >
> > > How are you measuring these things on the storage node, iostat, atop?
> > > At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should
> > > register about/over 50% utilization, given that its top speed is 460MB/s.
> > >
> > > With Intel DC SSDs you can pretty much take the sequential write
> > > speed
> > from
> > > their specifications page and roughly expect that to be the speed of
> > > your journal.
> > >
> > > For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain
> > > SATA HDDs will give us this when running "ceph tell osd.nn bench" in
> > > parallel against 2 OSDs that share a journal SSD:
> > > ---
> > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > avgrq-sz
> > avgqu-sz
> > > await r_await w_await  svctm  %util
> > > sdd               0.00     2.00    0.00  409.50     0.00 191370.75
> > 934.66   146.52  356.46
> > > 0.00  356.46   2.44 100.00
> > > sdl               0.00    85.50    0.50  120.50     2.00 49614.00
> > > 820.10
> > 2.25   18.51
> > > 0.00   18.59   8.20  99.20
> > > sdk               0.00    89.50    1.50  119.00     6.00 49348.00
> > > 819.15
> > 2.04   16.91
> > > 0.00   17.13   8.23  99.20
> > > ---
> > >
> > > Where sdd is the journal SSD and sdl/sdk are the OSD HDDs.
> > > And the SSD is nearly at 200MB/s (and 100%).
> > > For the record, that bench command is good for testing, but the result:
> > > ---
> > > # ceph tell osd.30 bench
> > > {
> > >     "bytes_written": 1073741824,
> > >     "blocksize": 4194304,
> > >     "bytes_per_sec": 100960114.000000 }
> > > ---
> > > should be taken with a grain of salt, realistically those OSDs can
> > > do
> > about
> > > 50MB/s sustained.
> > >
> > > On another cluster I have 200GB DC S3700s (360MB/s), holding 3
> > > journals
> > for
> > > 4 disk RAID10 (4GB HW cache Areca controller) OSDs.
> > > Thus the results are more impressive:
> > > ---
> > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > avgrq-sz
> > avgqu-sz
> > > await r_await w_await  svctm  %util
> > > sda               0.00   381.00    0.00  485.00     0.00 200374.00
> > 826.28     3.16    6.49
> > > 0.00    6.49   1.53  74.20
> > > sdb               0.00   350.50    1.00  429.00     4.00 177692.00
> > 826.49     2.78    6.46
> > > 4.00    6.46   1.53  65.60
> > > sdg               0.00     1.00    0.00  795.00     0.00 375514.50
> > 944.69   143.68  180.43
> > > 0.00  180.43   1.26 100.00
> > > ---
> > >
> > > Where sda/sdb are the OSD RAIDs and sdg is the journal SSD.
> > > Again, a near perfect match to the Intel specifications and also an
> > example
> > > where the journal is the bottleneck (never mind that his cluster is
> > > all
> > about
> > > IOPS, not throughput).
> > >
> > > As for the endurance mentioned above, these 200GB DC 3700s are/were
> > > overkill:
> > > ---
> > > 233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age
> > > Always
> > -
> > > 0
> > > 241 Host_Writes_32MiB       0x0032   100   100   000    Old_age
> > > Always
> > -
> > > 4818100
> > > 242 Host_Reads_32MiB        0x0032   100   100   000    Old_age
> > > Always
> > -
> > > 84403
> > > ---
> > >
> > > Again, this cluster is all about (small) IOPS, it only sees about
> > > 5MB/s
> > sustained
> > > I/O.
> > > So a 3610 might been a better fit, but not only didn't they exist
> > > back
> > then, it
> > > would have to be the 400GB model to match the speed, which is more
> > > expensive.
> > > A DC S3510 would be down 20% in terms of wearout (assuming same
> > > size) and of course significantly slower.
> > > With a 480GB 3510 (similar speed) it would still be about 10% worn
> > > out and thus still no match for the expected life time of this cluster.
> > >
> > > The numbers above do correlate nicely with dd or fio tests (with 4MB
> > > blocks) from VMs against the same clusters.
> > >
> > > > Obviously, it sucked with cold randread too (as expected).
> > > >
> > > Reads never touch the journal SSDs.
> > >
> > > > Just for comparacment, my baseline benchmark (fio/librbd, 4k,
> > > > iodepth=32, randwrite) for single OSD in the pool with size=1:
> > > >
> > > > Intel 53x and Pro 2500 Series SSDs - 600 IOPS Intel 730
> > > Consumer models, avoid.
> > >
> > > > and DC S35x0/3610/3700 Series SSDs - 6605 IOPS
> > > Again, what you're comparing here is only part of the picture.
> > > With tests as shown above you'd see significant differences.
> > >
> > > > Samsung SSD 840 Series - 739 IOPS
> > > Also consumer model, with impressive and unpredicted deaths reported.
> > >
> > > Christian
> > >
> > > > EDGE Boost Pro Plus 7mm - 1000 IOPS
> > > >
> > > > (so 3500 is clear winner)
> > > >
> > > > On 07/06/2016 03:22 PM, Alwin Antreich wrote:
> > > > > Hi George,
> > > > >
> > > > > interesting result for your benchmark. May you please supply
> > > > > some more numbers? As we didn't get that good of a result on our tests.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Cheers,
> > > > > Alwin
> > > > >
> > > > >
> > > > > On 07/06/2016 02:03 PM, George Shuklin wrote:
> > > > >> Hello.
> > > > >>
> > > > >> I've been testing Intel 3500 as journal store for few HDD-based
> > > > >> OSD. I stumble on issues with multiple partitions (>4) and UDEV
> > > > >> (sda5, sda6,etc sometime do not appear after partition creation).
> > > > >> And I'm thinking that partition is not that useful for OSD
> > > > >> management, because linux do no allow partition rereading with
> > > > >> it contains used volumes.
> > > > >>
> > > > >> So my question: How you store many journals on SSD? My initial
> > > > >> thoughts:
> > > > >>
> > > > >> 1)  filesystem with filebased journals
> > > > >> 2) LVM with volumes
> > > > >>
> > > > >> Anything else? Best practice?
> > > > >>
> > > > >> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM
> > > > >> HDD. _______________________________________________
> > > > >> ceph-users mailing list
> > > > >> ceph-users@xxxxxxxxxxxxxx
> > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com