Re: multiple journals on SSD

George Shuklin <george.shuklin@xxxxxxxxx> · Thu, 7 Jul 2016 13:29:26 +0300

The are two problems I found so far:

1) You can not alter parition table if it is in the use. That means you 
need to stop all ceph-osd who use journals on given OSD to change 
anything on it. Worse: you can change, but you can not force kernel to 
reread partition table.
2) I found udev bug with 5th and more partion detection. Basically, 
after you create 4 GPT-based parition, and after that create 5th, UDEV 
do not create /dev/sdx5 (6, 7, so on).
3) When I've tried to automate this process (OSD creation) with ansible, 
I found that it very prone to time-based errors, like 'partition busy', 
or 'too many partition created in raw and not every one is visible at 
next stage). Worse: even if I add blockdev --rereadpt stage, it fails 
with 'device busy' message. I spend whole day trying to do it right, but 
at the end of the day it was still '50% of fail' when creating 8+ OSD in 
a row. (And I can't do it 'one by one' - see para. 1)

On  the next day I remade playbook on LVM. It takes just 1 hr (with 
debug) and it works perfectly - no a single race condition. And whole 
playbook shrinks ~3 times:

All steps:
- Configure udev to change LV owner to ceph
- Create volume group for journals
- Create logical volumes for journals
- Create data partition
- Create XFS filesystem
- Create directory
- temporal mount
- chown for directory
- Create OSD filesystem
- Create symlink for journal
- Add OSD to ceph
- Add auth in ceph
- unmount temp. mount
- Activate OSD via GPT

And that's all.

About performance issue for LVM. I think it is negligible (if we will 
not play with copy-on-write snapshots and other strange things). For 
HDD-OSD with journals on SSD main concern is not IOPS or latency on the 
journal (HDD will gives big latency anyway), but throughput. Single SSD 
capable for 300-500MB/s of linear writing, and ~10 HDD behind it  can 
gives up to 1.5GB/s.

device mapper is pretty fast thing if it just doing remapping.

On 07/07/2016 05:22 AM, Christian Balzer wrote:
Hello,

I have a multitude of of problems with the benchmarks and conclusions
here, more below.

But firstly to address the question of the OP, definitely not filesystem
based journals.
Another layer of overhead and delays, something I'd be willing to ignore
if we're talking about a full SSD as OSD with an inline journal, but not
with journal SSDs.
Similar with LVM, though with a lower impact.

Partitions really are your best bet.

On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote:

Yes.

On my lab (not production yet) with 9 7200 SATA (OSD) and one INTEL
SSDSC2BB800G4 (800G, 9 journals)
First and foremost, a DC 3510 with 1 DWPD endurance is not my idea of good
journal device, even if it had the performance.
If you search in the ML archives there is at least one case where somebody
lost a full storage node precisely because their DC S3500s were worn out:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28083.html

Unless you have a read-mostly cluster, a 400GB DC S3610 (same or lower
price) would be a better deal, at 50% more endurance and only slightly
lower sequential write speed.

And depending on your expected write volume (which you should
know/estimate as close as possible before buying HW), a 400GB DC S3710
might be the best deal when it comes to TBW/$.
It's 30% more expensive than your 3510, but has the same speed and an
endurance that's 5 times greater.

during random write I got ~90%
utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With
linear writing it somehow worse: I got 250Mb/s on SSD, which translated
to 240Mb of all OSD combined.

This test shows us a lot of things, mostly the failings of filestore.
But only partially if a SSD is a good fit for journals or not.

How are you measuring these things on the storage node, iostat, atop?
At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should register
about/over 50% utilization, given that its top speed is 460MB/s.

With Intel DC SSDs you can pretty much take the sequential write speed
from their specifications page and roughly expect that to be the speed of
your journal.

For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain SATA
HDDs will give us this when running "ceph tell osd.nn bench" in
parallel against 2 OSDs that share a journal SSD:
---
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00     2.00    0.00  409.50     0.00 191370.75   934.66   146.52  356.46    0.00  356.46   2.44 100.00
sdl               0.00    85.50    0.50  120.50     2.00 49614.00   820.10     2.25   18.51    0.00   18.59   8.20  99.20
sdk               0.00    89.50    1.50  119.00     6.00 49348.00   819.15     2.04   16.91    0.00   17.13   8.23  99.20
---

Where sdd is the journal SSD and sdl/sdk are the OSD HDDs.
And the SSD is nearly at 200MB/s (and 100%).
For the record, that bench command is good for testing, but the result:
---
# ceph tell osd.30 bench
{
     "bytes_written": 1073741824,
     "blocksize": 4194304,
     "bytes_per_sec": 100960114.000000
}
---
should be taken with a grain of salt, realistically those OSDs can do
about 50MB/s sustained.

On another cluster I have 200GB DC S3700s (360MB/s), holding 3 journals
for 4 disk RAID10 (4GB HW cache Areca controller) OSDs.
Thus the results are more impressive:
---
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00   381.00    0.00  485.00     0.00 200374.00   826.28     3.16    6.49    0.00    6.49   1.53  74.20
sdb               0.00   350.50    1.00  429.00     4.00 177692.00   826.49     2.78    6.46    4.00    6.46   1.53  65.60
sdg               0.00     1.00    0.00  795.00     0.00 375514.50   944.69   143.68  180.43    0.00  180.43   1.26 100.00
---

Where sda/sdb are the OSD RAIDs and sdg is the journal SSD.
Again, a near perfect match to the Intel specifications and also an
example where the journal is the bottleneck (never mind that his cluster
is all about IOPS, not throughput).

As for the endurance mentioned above, these 200GB DC 3700s are/were
overkill:
---
233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       4818100
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       84403
---

Again, this cluster is all about (small) IOPS, it only sees about 5MB/s
sustained I/O.
So a 3610 might been a better fit, but not only didn't they exist back
then, it would have to be the 400GB model to match the speed, which is
more expensive.
A DC S3510 would be down 20% in terms of wearout (assuming same size) and
of course significantly slower.
With a 480GB 3510 (similar speed) it would still be about 10% worn out and
thus still no match for the expected life time of this cluster.

The numbers above do correlate nicely with dd or fio tests (with 4MB
blocks) from VMs against the same clusters.

Obviously, it sucked with cold randread too (as expected).

Reads never touch the journal SSDs.

Just for comparacment, my baseline benchmark (fio/librbd, 4k,
iodepth=32, randwrite) for single OSD in the pool with size=1:

Intel 53x and Pro 2500 Series SSDs - 600 IOPS
Intel 730
Consumer models, avoid.

and DC S35x0/3610/3700 Series SSDs - 6605 IOPS
Again, what you're comparing here is only part of the picture.
With tests as shown above you'd see significant differences.

Samsung SSD 840 Series - 739 IOPS
Also consumer model, with impressive and unpredicted deaths reported.

Christian

EDGE Boost Pro Plus 7mm - 1000 IOPS

(so 3500 is clear winner)

On 07/06/2016 03:22 PM, Alwin Antreich wrote:
Hi George,

interesting result for your benchmark. May you please supply some more
numbers? As we didn't get that good of a result on our tests.

Thanks.

Cheers,
Alwin

On 07/06/2016 02:03 PM, George Shuklin wrote:
Hello.

I've been testing Intel 3500 as journal store for few HDD-based OSD.
I stumble on issues with multiple partitions (>4) and UDEV (sda5,
sda6,etc sometime do not appear after partition creation). And I'm
thinking that partition is not that useful for OSD management,
because linux do no allow partition rereading with it contains used
volumes.

So my question: How you store many journals on SSD? My initial
thoughts:

1)  filesystem with filebased journals
2) LVM with volumes

Anything else? Best practice?

P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
_______________________________________________

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com