Re: any recommendation of using EnhanceIO?

Jan Schermer <jan@xxxxxxxxxxx> · Tue, 18 Aug 2015 18:12:37 +0200

> On 18 Aug 2015, at 16:44, Nick Fisk <nick@xxxxxxxxxx> wrote:
> 
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Mark Nelson
>> Sent: 18 August 2015 14:51
>> To: Nick Fisk <nick@xxxxxxxxxx>; 'Jan Schermer' <jan@xxxxxxxxxxx>
>> Cc: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  any recommendation of using EnhanceIO?
>> 
>> 
>> 
>> On 08/18/2015 06:47 AM, Nick Fisk wrote:
>>> Just to chime in, I gave dmcache a limited test but its lack of proper
>> writeback cache ruled it out for me. It only performs write back caching on
>> blocks already on the SSD, whereas I need something that works like a
>> Battery backed raid controller caching all writes.
>>> 
>>> It's amazing the 100x performance increase you get with RBD's when doing
>> sync writes and give it something like just 1GB write back cache with
>> flashcache.
>> 
>> For your use case, is it ok that data may live on the flashcache for some
>> amount of time before making to ceph to be replicated?  We've wondered
>> internally if this kind of trade-off is acceptable to customers or not should the
>> flashcache SSD fail.
> 
> Yes, I agree, it's not ideal. But I believe it’s the only way to get the performance required for some workloads that need write latency's <1ms. 
> 
> I'm still in testing at the moment with the testing kernel that includes blk-mq fixes for large queue depths and max io sizes. But if we decide to put into production, it would be using 2x SAS dual port SSD's in RAID1 across two servers for HA. As we are currently using iSCSI from these two servers, there is no real loss of availability by doing this. Generally I think as long as you build this around the fault domains of the application you are caching, it shouldn't impact too much.
> 
> I guess for people using openstack and other direct RBD interfaces it may not be such an attractive option. I've been thinking that maybe Ceph needs to have an additional daemon with very low overheads, which is run on SSD's to provide shared persistent cache devices for librbd. There's still a trade off, maybe not as much as using Flashcache, but for some workloads like database's, many people may decide that it's worth it. Of course I realise this would be a lot of work and everyone is really busy, but in terms of performance gained it would most likely have a dramatic effect in making Ceph look comparable to other solutions like VSAN or ScaleIO when it comes to high iops/low latency stuff.
> 

Additional daemon that is persistent how? Isn't that what journal does already, just too slowly?

I think the best (and easiest!) approach is to mimic what a monilithic SAN does

Currently
1) client issues blocking/atomic/sync IO
2) rbd client sends this IO to all OSDs
3) after all OSDs "process the IO", the IO is finished and considered persistent

That has serious implications
	* every IO is processed separately, not much coalescing
	* OSD processes add the latency when processing this IO
	* one OSD can be slow momentarily, IO backs up and the cluster stalls

Let me just select what "processing the IO" means with respect to my architecture and I can likely get a 100x improvement

Let me choose:

1) WHERE the IO is persisted
Do I really need all (e.g. 3) OSDs to persist the data or is quorum (2) sufficient?
Not waiting for one slow OSD gives me at least some SLA for planned tasks like backfilling, scrubbing, deep-scrubbing
Hands up who can afford to leav deep-scrub enabled in production...

2) WHEN the IO is persisted
Do I really need all OSDs to flush the data to disk?
If all the nodes are in the same cabinet and on the same UPS then this makes sense.
But my nodes are actually in different buildings ~10km apart. The chances of power failing simultaneously, N+1 UPSes failing simultaneously, diesels failing simultaneously... When nukes start falling and this happens then I'll start looking for backups.
Even if your nodes are in one datacentre, there are likely redundant (2+) circuits.
And even if you have just one cabinet, you can add 3x UPS in there and gain a nice speed boost.

So the IO could be actually pretty safe and happy when it gets to a remote buffers on enough (quorum) nodes  and waits for processing. It can be batched, it can be coalesced, it can be rewritten with subsequent updates...

3)  WHAT amount of IO is stored
Do I need to have the last transaction or can I tolerate 1 minute of missing data?
Checkpoints, checksums on last transaction, rollback (journal already does this AFAIK)...

4) I DON'T CARE mode :-)
qemu cache=unsafe equivalent but set on a RBD volume/pool
Because sometimes you just need to crunch data without really storing them persistently - how are CERN/HADOOP/Big Data guys approcaching this?
And you can't always disable flushing. Filesystems have "nobarriers" (usually) but if you need a block device for raw database tablespace, you're pretty much SOL without lots of trickery

1) is doable eventually.

2) is doable almost immediately
	a) just ACK the IO when you get it, let the client unblock on quorum
	or
	b) drop the journal, write all data asynchronously, let the filesystem handle consistency and let me tune dirty_writeback_centisecs to get the goal i want in respect to 3)

4) simple to do, unusable for production (for most of us)
	but flushing is expensive so why flush because a file metadata changed on a QA machine?
	Dev&QA often create a higher load than production itself..

sorry, got carried away, again....

Jan

>> 
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
>>>> Of Jan Schermer
>>>> Sent: 18 August 2015 12:44
>>>> To: Mark Nelson <mnelson@xxxxxxxxxx>
>>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>>> Subject: Re:  any recommendation of using EnhanceIO?
>>>> 
>>>> I did not. Not sure why now - probably for the same reason I didn't
>>>> extensively test bcache.
>>>> I'm not a real fan of device mapper though, so if I had to choose I'd
>>>> still go for bcache :-)
>>>> 
>>>> Jan
>>>> 
>>>>> On 18 Aug 2015, at 13:33, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>>>>> 
>>>>> Hi Jan,
>>>>> 
>>>>> Out of curiosity did you ever try dm-cache?  I've been meaning to
>>>>> give it a
>>>> spin but haven't had the spare cycles.
>>>>> 
>>>>> Mark
>>>>> 
>>>>> On 08/18/2015 04:00 AM, Jan Schermer wrote:
>>>>>> I already evaluated EnhanceIO in combination with CentOS 6 (and
>>>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
>>>>>> It worked fine during benchmarks and stress tests, but once we run
>>>>>> DB2
>>>> on it it panicked within minutes and took all the data with it
>>>> (almost literally - files that werent touched, like OS binaries were
>>>> b0rked and the filesystem was unsalvageable).
>>>>>> If you disregard this warning - the performance gains weren't that
>>>>>> great
>>>> either, at least in a VM. It had problems when flushing to disk after
>>>> reaching dirty watermark and the block size has some
>>>> not-well-documented implications (not sure now, but I think it only
>>>> cached IO _larger_than the block size, so if your database keeps
>>>> incrementing an XX-byte counter it will go straight to disk).
>>>>>> 
>>>>>> Flashcache doesn't respect barriers (or does it now?) - if that's
>>>>>> ok for you
>>>> than go for it, it should be stable and I used it in the past in
>>>> production without problems.
>>>>>> 
>>>>>> bcache seemed to work fine, but I needed to
>>>>>> a) use it for root
>>>>>> b) disable and enable it on the fly (doh)
>>>>>> c) make it non-persisent (flush it) before reboot - not sure if
>>>>>> that was
>>>> possible either.
>>>>>> d) all that in a customer's VM, and that customer didn't have a
>>>>>> strong
>>>> technical background to be able to fiddle with it...
>>>>>> So I haven't tested it heavily.
>>>>>> 
>>>>>> Bcache should be the obvious choice if you are in control of the
>>>>>> environment. At least you can cry on LKML's shoulder when you lose
>>>>>> data :-)
>>>>>> 
>>>>>> Jan
>>>>>> 
>>>>>> 
>>>>>>> On 18 Aug 2015, at 01:49, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
>>>> wrote:
>>>>>>> 
>>>>>>> What about https://github.com/Frontier314/EnhanceIO?  Last commit
>>>>>>> 2 months ago, but no external contributors :(
>>>>>>> 
>>>>>>> The nice thing about EnhanceIO is there is no need to change
>>>>>>> device name, unlike bcache, flashcache etc.
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Alex
>>>>>>> 
>>>>>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz
>>>>>>> <dang@xxxxxxxxxx>
>>>> wrote:
>>>>>>>> I did some (non-ceph) work on these, and concluded that bcache
>>>>>>>> was the best supported, most stable, and fastest.  This was ~1
>>>>>>>> year ago, to take it with a grain of salt, but that's what I would
>> recommend.
>>>>>>>> 
>>>>>>>> Daniel
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ________________________________
>>>>>>>> From: "Dominik Zalewski" <dzalewski@xxxxxxxxxxx>
>>>>>>>> To: "German Anders" <ganders@xxxxxxxxxxxx>
>>>>>>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
>>>>>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>>>>>>> Subject: Re:  any recommendation of using EnhanceIO?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I’ve asked same question last weeks or so (just search the
>>>>>>>> mailing list archives for EnhanceIO :) and got some interesting
>> answers.
>>>>>>>> 
>>>>>>>> Looks like the project is pretty much dead since it was bought
>>>>>>>> out by
>>>> HGST.
>>>>>>>> Even their website has some broken links in regards to EnhanceIO
>>>>>>>> 
>>>>>>>> I’m keen to try flashcache or bcache (its been in the mainline
>>>>>>>> kernel for some time)
>>>>>>>> 
>>>>>>>> Dominik
>>>>>>>> 
>>>>>>>> On 1 Jul 2015, at 21:13, German Anders <ganders@xxxxxxxxxxxx>
>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi cephers,
>>>>>>>> 
>>>>>>>>   Is anyone out there that implement enhanceIO in a production
>>>> environment?
>>>>>>>> any recommendation? any perf output to share with the diff
>>>>>>>> between using it and not?
>>>>>>>> 
>>>>>>>> Thanks in advance,
>>>>>>>> 
>>>>>>>> German
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> 
>>> 
>>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com