Re: multiple journals on SSD

Zoltan Arnold Nagy <zoltan@xxxxxxxxxxxxxxxxxx> · Thu, 7 Jul 2016 23:19:35 +0200

Hi Nick,

How large NVMe drives are you running per 12 disks?

In my current setup I have 4xP3700 per 36 disks but I feel like I could get by with 2… Just looking for community experience :-)

Cheers,
Zoltan

> On 07 Jul 2016, at 10:45, Nick Fisk <nick@xxxxxxxxxx> wrote:
> 
> Just to add if you really want to go with lots of HDD's to Journals then go
> NVME. They are not a lot more expensive than the equivalent SATA based
> 3700's, but the latency is low low low. Here is an example of a node I have
> just commissioned with 12 HDD's to one P3700
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdb               0.00     0.00   68.00    0.00  8210.00     0.00   241.47
> 0.26    3.85    3.85    0.00   2.09  14.20
> sdd               2.50     0.00  198.50   22.00 24938.00  9422.00   311.66
> 4.34   27.80    6.21  222.64   2.45  54.00
> sdc               0.00     0.00   63.00    0.00  7760.00     0.00   246.35
> 0.15    2.16    2.16    0.00   1.56   9.80
> sda               0.00     0.00   61.50   47.00  7600.00 22424.00   553.44
> 2.77   25.57    2.63   55.57   3.82  41.40
> nvme0n1           0.00    22.50    2.00 2605.00     8.00 139638.00   107.13
> 0.14    0.05    0.00    0.05   0.03   6.60
> sdg               0.00     0.00   61.00   28.00  6230.00 12696.00   425.30
> 3.66   74.79    5.84  225.00   3.87  34.40
> sdf               0.00     0.00   34.50   47.00  4108.00 21702.00   633.37
> 3.56   43.75    1.51   74.77   2.85  23.20
> sdh               0.00     0.00   75.00   15.50  9180.00  4984.00   313.02
> 0.45   12.55    3.28   57.42   3.51  31.80
> sdi               1.50     0.50  142.00   48.50 18102.00 21924.00   420.22
> 3.60   18.92    4.99   59.71   2.70  51.40
> sdj               0.50     0.00   74.50    5.00  9362.00  1832.00   281.61
> 0.33    4.10    3.33   15.60   2.44  19.40
> sdk               0.00     0.00   54.00    0.00  6420.00     0.00   237.78
> 0.12    2.30    2.30    0.00   1.70   9.20
> sdl               0.00     0.00   21.00    1.50  2286.00    16.00   204.62
> 0.32   18.13   13.81   78.67   6.67  15.00
> sde               0.00     0.00   98.00    0.00 12304.00     0.00   251.10
> 0.30    3.10    3.10    0.00   2.08  20.40
> 
> 50us latency at 2605 iops!!!
> 
> Compared to one of the other nodes with 2 100GB S3700's, 6 disks each
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00    30.50    0.00  894.50     0.00 50082.00   111.98
> 0.36    0.41    0.00    0.41   0.20  17.80
> sdb               0.00     9.00    0.00  551.00     0.00 32044.00   116.31
> 0.23    0.42    0.00    0.42   0.19  10.40
> sdc               0.00     2.00    6.50   17.50   278.00  8422.00   725.00
> 1.08   44.92   18.46   54.74   8.08  19.40
> sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00
> 0.00    0.00    0.00    0.00   0.00   0.00
> sde               0.00     2.50   27.50   21.50  2112.00  9866.00   488.90
> 0.59   12.04    6.91   18.60   6.53  32.00
> sdf               0.50     0.00   50.50    0.00  6170.00     0.00   244.36
> 0.18    4.63    4.63    0.00   2.10  10.60
> md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00
> 0.00    0.00    0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00
> 0.00    0.00    0.00    0.00   0.00   0.00
> sdg               0.00     1.50   32.00  386.50  3970.00 12188.00    77.22
> 0.15    0.35    0.50    0.34   0.15   6.40
> sdh               0.00     0.00    6.00    0.00    34.00     0.00    11.33
> 0.07   12.67   12.67    0.00  11.00   6.60
> sdi               0.00     0.50    1.50   19.50     6.00  8862.00   844.57
> 0.96   45.71   33.33   46.67   6.57  13.80
> sdj               0.00     0.00   67.00    0.00  8214.00     0.00   245.19
> 0.17    2.51    2.51    0.00   1.88  12.60
> sdk               1.50     2.50   61.00   48.00  6216.00 21020.00   499.74
> 2.01   18.46   11.41   27.42   5.05  55.00
> sdm               0.00     0.00   30.50    0.00  3576.00     0.00   234.49
> 0.07    2.43    2.43    0.00   1.90   5.80
> sdl               0.00     4.50   25.00   23.50  2092.00 12648.00   607.84
> 1.36   19.42    5.60   34.13   4.04  19.60
> sdn               0.50     0.00   23.00    0.00  2670.00     0.00   232.17
> 0.07    2.96    2.96    0.00   2.43   5.60
> 
> Pretty much 10x the latency. I'm seriously impressed with these NVME things.
> 
> 
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Christian Balzer
>> Sent: 07 July 2016 03:23
>> To: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  multiple journals on SSD
>> 
>> 
>> Hello,
>> 
>> I have a multitude of of problems with the benchmarks and conclusions
> here,
>> more below.
>> 
>> But firstly to address the question of the OP, definitely not filesystem
> based
>> journals.
>> Another layer of overhead and delays, something I'd be willing to ignore
> if
>> we're talking about a full SSD as OSD with an inline journal, but not with
>> journal SSDs.
>> Similar with LVM, though with a lower impact.
>> 
>> Partitions really are your best bet.
>> 
>> On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote:
>> 
>>> Yes.
>>> 
>>> On my lab (not production yet) with 9 7200 SATA (OSD) and one INTEL
>>> SSDSC2BB800G4 (800G, 9 journals)
>> 
>> First and foremost, a DC 3510 with 1 DWPD endurance is not my idea of good
>> journal device, even if it had the performance.
>> If you search in the ML archives there is at least one case where somebody
>> lost a full storage node precisely because their DC S3500s were worn out:
>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28083.html
>> 
>> Unless you have a read-mostly cluster, a 400GB DC S3610 (same or lower
>> price) would be a better deal, at 50% more endurance and only slightly
> lower
>> sequential write speed.
>> 
>> And depending on your expected write volume (which you should
>> know/estimate as close as possible before buying HW), a 400GB DC S3710
>> might be the best deal when it comes to TBW/$.
>> It's 30% more expensive than your 3510, but has the same speed and an
>> endurance that's 5 times greater.
>> 
>>> during random write I got ~90%
>>> utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With
>>> linear writing it somehow worse: I got 250Mb/s on SSD, which
>>> translated to 240Mb of all OSD combined.
>>> 
>> This test shows us a lot of things, mostly the failings of filestore.
>> But only partially if a SSD is a good fit for journals or not.
>> 
>> How are you measuring these things on the storage node, iostat, atop?
>> At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should register
>> about/over 50% utilization, given that its top speed is 460MB/s.
>> 
>> With Intel DC SSDs you can pretty much take the sequential write speed
> from
>> their specifications page and roughly expect that to be the speed of your
>> journal.
>> 
>> For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain SATA
>> HDDs will give us this when running "ceph tell osd.nn bench" in parallel
>> against 2 OSDs that share a journal SSD:
>> ---
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz
>> await r_await w_await  svctm  %util
>> sdd               0.00     2.00    0.00  409.50     0.00 191370.75
> 934.66   146.52  356.46
>> 0.00  356.46   2.44 100.00
>> sdl               0.00    85.50    0.50  120.50     2.00 49614.00   820.10
> 2.25   18.51
>> 0.00   18.59   8.20  99.20
>> sdk               0.00    89.50    1.50  119.00     6.00 49348.00   819.15
> 2.04   16.91
>> 0.00   17.13   8.23  99.20
>> ---
>> 
>> Where sdd is the journal SSD and sdl/sdk are the OSD HDDs.
>> And the SSD is nearly at 200MB/s (and 100%).
>> For the record, that bench command is good for testing, but the result:
>> ---
>> # ceph tell osd.30 bench
>> {
>>    "bytes_written": 1073741824,
>>    "blocksize": 4194304,
>>    "bytes_per_sec": 100960114.000000
>> }
>> ---
>> should be taken with a grain of salt, realistically those OSDs can do
> about
>> 50MB/s sustained.
>> 
>> On another cluster I have 200GB DC S3700s (360MB/s), holding 3 journals
> for
>> 4 disk RAID10 (4GB HW cache Areca controller) OSDs.
>> Thus the results are more impressive:
>> ---
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz
>> await r_await w_await  svctm  %util
>> sda               0.00   381.00    0.00  485.00     0.00 200374.00
> 826.28     3.16    6.49
>> 0.00    6.49   1.53  74.20
>> sdb               0.00   350.50    1.00  429.00     4.00 177692.00
> 826.49     2.78    6.46
>> 4.00    6.46   1.53  65.60
>> sdg               0.00     1.00    0.00  795.00     0.00 375514.50
> 944.69   143.68  180.43
>> 0.00  180.43   1.26 100.00
>> ---
>> 
>> Where sda/sdb are the OSD RAIDs and sdg is the journal SSD.
>> Again, a near perfect match to the Intel specifications and also an
> example
>> where the journal is the bottleneck (never mind that his cluster is all
> about
>> IOPS, not throughput).
>> 
>> As for the endurance mentioned above, these 200GB DC 3700s are/were
>> overkill:
>> ---
>> 233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always
> -
>> 0
>> 241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always
> -
>> 4818100
>> 242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always
> -
>> 84403
>> ---
>> 
>> Again, this cluster is all about (small) IOPS, it only sees about 5MB/s
> sustained
>> I/O.
>> So a 3610 might been a better fit, but not only didn't they exist back
> then, it
>> would have to be the 400GB model to match the speed, which is more
>> expensive.
>> A DC S3510 would be down 20% in terms of wearout (assuming same size)
>> and of course significantly slower.
>> With a 480GB 3510 (similar speed) it would still be about 10% worn out and
>> thus still no match for the expected life time of this cluster.
>> 
>> The numbers above do correlate nicely with dd or fio tests (with 4MB
>> blocks) from VMs against the same clusters.
>> 
>>> Obviously, it sucked with cold randread too (as expected).
>>> 
>> Reads never touch the journal SSDs.
>> 
>>> Just for comparacment, my baseline benchmark (fio/librbd, 4k,
>>> iodepth=32, randwrite) for single OSD in the pool with size=1:
>>> 
>>> Intel 53x and Pro 2500 Series SSDs - 600 IOPS Intel 730
>> Consumer models, avoid.
>> 
>>> and DC S35x0/3610/3700 Series SSDs - 6605 IOPS
>> Again, what you're comparing here is only part of the picture.
>> With tests as shown above you'd see significant differences.
>> 
>>> Samsung SSD 840 Series - 739 IOPS
>> Also consumer model, with impressive and unpredicted deaths reported.
>> 
>> Christian
>> 
>>> EDGE Boost Pro Plus 7mm - 1000 IOPS
>>> 
>>> (so 3500 is clear winner)
>>> 
>>> On 07/06/2016 03:22 PM, Alwin Antreich wrote:
>>>> Hi George,
>>>> 
>>>> interesting result for your benchmark. May you please supply some
>>>> more numbers? As we didn't get that good of a result on our tests.
>>>> 
>>>> Thanks.
>>>> 
>>>> Cheers,
>>>> Alwin
>>>> 
>>>> 
>>>> On 07/06/2016 02:03 PM, George Shuklin wrote:
>>>>> Hello.
>>>>> 
>>>>> I've been testing Intel 3500 as journal store for few HDD-based OSD.
>>>>> I stumble on issues with multiple partitions (>4) and UDEV (sda5,
>>>>> sda6,etc sometime do not appear after partition creation). And I'm
>>>>> thinking that partition is not that useful for OSD management,
>>>>> because linux do no allow partition rereading with it contains used
>>>>> volumes.
>>>>> 
>>>>> So my question: How you store many journals on SSD? My initial
>>>>> thoughts:
>>>>> 
>>>>> 1)  filesystem with filebased journals
>>>>> 2) LVM with volumes
>>>>> 
>>>>> Anything else? Best practice?
>>>>> 
>>>>> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> 
>> --
>> Christian Balzer        Network/Systems Engineer
>> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com