Re: any recommendation of using EnhanceIO?

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 18 Aug 2015 11:24:07 -0500

On 08/18/2015 11:08 AM, Nick Fisk wrote:
-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
Mark Nelson
Sent: 18 August 2015 15:55
To: Jan Schermer <jan@xxxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk <nick@xxxxxxxxxx>
Subject: Re:  any recommendation of using EnhanceIO?

On 08/18/2015 09:24 AM, Jan Schermer wrote:

On 18 Aug 2015, at 15:50, Mark Nelson <mnelson@xxxxxxxxxx> wrote:

On 08/18/2015 06:47 AM, Nick Fisk wrote:
Just to chime in, I gave dmcache a limited test but its lack of proper
writeback cache ruled it out for me. It only performs write back caching on
blocks already on the SSD, whereas I need something that works like a
Battery backed raid controller caching all writes.

It's amazing the 100x performance increase you get with RBD's when
doing sync writes and give it something like just 1GB write back cache with
flashcache.

For your use case, is it ok that data may live on the flashcache for some
amount of time before making to ceph to be replicated?  We've wondered
internally if this kind of trade-off is acceptable to customers or not should the
flashcache SSD fail.

Was it me pestering you about it? :-)
All my customers need this desperately - people don't care about having
RPO=0 seconds when all hell breaks loose.
People care about their apps being slow all the time which is effectively an
"outage".
I (sysadmin) care about having consistent data where all I have to do is start
up the VMs.

Any ideas how to approach this? I think even checkpoints (like reverting to
a known point in the past) would be great and sufficient for most people...

Here's kind of how I see the field right now:

1) Cache at the client level.  Likely fastest but obvious issues like above.
RAID1 might be an option at increased cost.  Lack of barriers in some
implementations scary.

Agreed.

2) Cache below the OSD.  Not much recent data on this.  Not likely as fast as
client side cache, but likely cheaper (fewer OSD nodes than client nodes?).
Lack of barriers in some implementations scary.

This also has the benefit of caching the leveldb on the OSD, so get a big performance gain from there too for small sequential writes. I looked at using Flashcache for this too but decided it was adding to much complexity and risk.

I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is there anything in the pipeline for something like moving the filestore to use RocksDB?

I believe you can already do this, though I haven't tested it.  You can 
certainly move the monitors to rocksdb (tested) and newstore uses 
rocksdb as well.

3) Ceph Cache Tiering. Network overhead and write amplification on
promotion makes this primarily useful when workloads fit mostly into the
cache tier.  Overall safe design but care must be taken to not over-promote.

4) separate SSD pool.  Manual and not particularly flexible, but perhaps best
for applications that need consistently high performance.

I think it depends on the definition of performance. Currently even very fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of write latency. If your performance requirements are for large queue depths then you will probably be alright. If you require something that mirrors the performance of traditional write back cache, then even pure SSD Pools can start to struggle.

Agreed.  This is definitely the crux of the problem.  The example below 
is a great start!  It'd would be fantastic if we could get more feedback 
from the list on the relative importance of low latency operations vs 
high IOPS through concurrency.  We have general suspicions but not a ton 
of actual data regarding what folks are seeing in practice and under 
what scenarios.

To give a real world example of what I see when doing various tests,  here is a rough guide to IOP's when removing a snapshot on a ESX server

Traditional Array 10K disks = 300-600 IOPs
Ceph 7.2K + SSD Journal = 100-200 IOPs (LevelDB syncing on OSD seems to be the main limitation)
Ceph Pure SSD Pool = 500 IOPs (Intel s3700 SSD's)

I'd be curious to see how much jemalloc or tcmalloc 2.4 + 128MB TC help 
here.  Sandisk and Intel have both done some very useful investigations, 
I've got some additional tests replicating some of their findings coming 
shortly.

Ceph Cache Tiering = 10-500 IOPs (As we know, misses can be very painful)

Indeed.  There's some work going on in this area too.  Hopefully we'll 
know how some of our ideas pan out later this week.  Assuming excessive 
promotions aren't a problem, the jemalloc/tcmalloc improvements I 
suspect will generally make cache teiring more interesting (though 
buffer cache will still be the primary source of really hot cached reads)

Ceph + RBD Caching with Flashcache = 200-1000 IOPs (Readahead can give high bursts if snapshot blocks are sequential)

Good to know!

And when copying VM's to datastore (ESXi does this in sequential 64k IO's.....yes silly I know)

Traditional Array 10K disks = ~100MB/s (Limited by 1GB interface, on other arrays I guess this scales)
Ceph 7.2K + SSD Journal = ~20MB/s (Again LevelDB sync seems to limit here for sequential writes)

This is pretty bad.  Is RBD cache enabled?

Ceph Pure SSD Pool = ~50MB/s (Ceph CPU bottleneck is occurring)

Again, seems pretty rough compared to what I'd expect to see!

Ceph Cache Tiering = ~50MB/s when writing to new block, <10MB/s when promote+overwrite
Ceph + RBD Caching with Flashcache = As fast as the SSD will go

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
Behalf Of Jan Schermer
Sent: 18 August 2015 12:44
To: Mark Nelson <mnelson@xxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  any recommendation of using EnhanceIO?

I did not. Not sure why now - probably for the same reason I didn't
extensively test bcache.
I'm not a real fan of device mapper though, so if I had to choose
I'd still go for bcache :-)

Jan

On 18 Aug 2015, at 13:33, Mark Nelson <mnelson@xxxxxxxxxx>
wrote:

Hi Jan,

Out of curiosity did you ever try dm-cache?  I've been meaning to
give it a
spin but haven't had the spare cycles.

Mark

On 08/18/2015 04:00 AM, Jan Schermer wrote:
I already evaluated EnhanceIO in combination with CentOS 6 (and
backported 3.10 and 4.0 kernel-lt if I remember correctly).
It worked fine during benchmarks and stress tests, but once we
run DB2
on it it panicked within minutes and took all the data with it
(almost literally - files that werent touched, like OS binaries
were b0rked and the filesystem was unsalvageable).
If you disregard this warning - the performance gains weren't
that great
either, at least in a VM. It had problems when flushing to disk
after reaching dirty watermark and the block size has some
not-well-documented implications (not sure now, but I think it only
cached IO _larger_than the block size, so if your database keeps
incrementing an XX-byte counter it will go straight to disk).

Flashcache doesn't respect barriers (or does it now?) - if that's
ok for you
than go for it, it should be stable and I used it in the past in
production without problems.

bcache seemed to work fine, but I needed to
a) use it for root
b) disable and enable it on the fly (doh)
c) make it non-persisent (flush it) before reboot - not sure if
that was
possible either.
d) all that in a customer's VM, and that customer didn't have a
strong
technical background to be able to fiddle with it...
So I haven't tested it heavily.

Bcache should be the obvious choice if you are in control of the
environment. At least you can cry on LKML's shoulder when you
lose data :-)

Jan

On 18 Aug 2015, at 01:49, Alex Gorbachev
<ag@xxxxxxxxxxxxxxxxxxx>
wrote:

What about https://github.com/Frontier314/EnhanceIO?  Last
commit 2 months ago, but no external contributors :(

The nice thing about EnhanceIO is there is no need to change
device name, unlike bcache, flashcache etc.

Best regards,
Alex

On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz
<dang@xxxxxxxxxx>
wrote:
I did some (non-ceph) work on these, and concluded that bcache
was the best supported, most stable, and fastest.  This was ~1
year ago, to take it with a grain of salt, but that's what I would
recommend.

Daniel

________________________________
From: "Dominik Zalewski" <dzalewski@xxxxxxxxxxx>
To: "German Anders" <ganders@xxxxxxxxxxxx>
Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Sent: Wednesday, July 1, 2015 5:28:10 PM
Subject: Re:  any recommendation of using
EnhanceIO?

Hi,

I’ve asked same question last weeks or so (just search the
mailing list archives for EnhanceIO :) and got some interesting
answers.

Looks like the project is pretty much dead since it was bought
out by
HGST.
Even their website has some broken links in regards to
EnhanceIO

I’m keen to try flashcache or bcache (its been in the mainline
kernel for some time)

Dominik

On 1 Jul 2015, at 21:13, German Anders
<ganders@xxxxxxxxxxxx>
wrote:

Hi cephers,

    Is anyone out there that implement enhanceIO in a production
environment?
any recommendation? any perf output to share with the diff
between using it and not?

Thanks in advance,

German
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Nick Fisk
Technical Support Engineer

System Professional Ltd
tel: 01825 830000
mob: 07711377522
fax: 01825 830001
mail: Nick.Fisk@xxxxxxxxxxxxx
web: www.sys-pro.co.uk<http://www.sys-pro.co.uk>

IT SUPPORT SERVICES | VIRTUALISATION | STORAGE | BACKUP AND DR | IT CONSULTING

Registered Office:
Wilderness Barns, Wilderness Lane, Hadlow Down, East Sussex, TN22 4HU
Registered in England and Wales.
Company Number: 04754200

Confidentiality: This e-mail and its attachments are intended for the above named only and may be confidential. If they have come to you in error you must take no action based on them, nor must you copy or show them to anyone; please reply to this e-mail and highlight the error.

Security Warning: Please note that this e-mail has been created in the knowledge that Internet e-mail is not a 100% secure communications medium. We advise that you understand and observe this lack of security when e-mailing us.

Viruses: Although we have taken steps to ensure that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free. Any views expressed in this e-mail message are those of the individual and not necessarily those of the company or any of its subsidiaries.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com