Re: any recommendation of using EnhanceIO?

Nick Fisk <nick@xxxxxxxxxx> · Tue, 18 Aug 2015 22:24:38 +0100

Hi Sam,

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Samuel Just
> Sent: 18 August 2015 21:38
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  any recommendation of using EnhanceIO?
> 
> 1.  We've kicked this around a bit.  What kind of failure semantics would
you
> be comfortable with here (that is, what would be reasonable behavior if
the
> client side cache fails)?

I would either expect to provide the cache with a redundant block device (ie
RAID1 SSD's) or the cache to allow itself to be configured to mirror across
two SSD's. Of course single SSD's can be used if the user accepts the risk.
If the cache did the mirroring then you could do fancy stuff like mirror the
writes, but leave the read cache blocks as single copies to increase the
cache capacity.

In either case although an outage is undesirable, its only data loss which
would be unacceptable, which would hopefully be avoided by the mirroring. As
part of this, it would need to be a way to make sure a "dirty" RBD can't be
accessed unless the corresponding cache is also attached.

I guess as it caching the RBD and not the pool or entire cluster, the cache
only needs to match the failure requirements of the application its caching.
If I need to cache a RBD that is on  a single server, there is no
requirement to make the cache redundant across racks/PDU's/servers...etc. 

I hope I've answered your question?

> 2. We've got a branch which should merge soon (tomorrow probably) which
> actually does allow writes to be proxied, so that should alleviate some of
> these pain points somewhat.  I'm not sure it is clever enough to allow
> through writefulls for an ec base tier though (but it would be a good
idea!) -

Excellent news, I shall look forward to testing in the future. I did mention
the proxy write for write fulls to someone who was working on the proxy
write code, but I'm not sure if it ever got followed up.

> Sam
> 
> On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Mark Nelson
> >> Sent: 18 August 2015 18:51
> >> To: Nick Fisk <nick@xxxxxxxxxx>; 'Jan Schermer' <jan@xxxxxxxxxxx>
> >> Cc: ceph-users@xxxxxxxxxxxxxx
> >> Subject: Re:  any recommendation of using EnhanceIO?
> >>
> >>
> >>
> >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> >> > <snip>
> >> >>>>
> >> >>>> Here's kind of how I see the field right now:
> >> >>>>
> >> >>>> 1) Cache at the client level.  Likely fastest but obvious issues
> >> >>>> like
> > above.
> >> >>>> RAID1 might be an option at increased cost.  Lack of barriers in
> >> >>>> some implementations scary.
> >> >>>
> >> >>> Agreed.
> >> >>>
> >> >>>>
> >> >>>> 2) Cache below the OSD.  Not much recent data on this.  Not
> >> >>>> likely as fast as client side cache, but likely cheaper (fewer
> >> >>>> OSD nodes than client
> >> >> nodes?).
> >> >>>> Lack of barriers in some implementations scary.
> >> >>>
> >> >>> This also has the benefit of caching the leveldb on the OSD, so
> >> >>> get a big
> >> >> performance gain from there too for small sequential writes. I
> >> >> looked at using Flashcache for this too but decided it was adding
> >> >> to much complexity and risk.
> >> >>>
> >> >>> I thought I read somewhere that RocksDB allows you to move its
> >> >>> WAL to
> >> >> SSD, is there anything in the pipeline for something like moving
> >> >> the filestore to use RocksDB?
> >> >>
> >> >> I believe you can already do this, though I haven't tested it.
> >> >> You can certainly move the monitors to rocksdb (tested) and
> >> >> newstore uses
> >> rocksdb as well.
> >> >>
> >> >
> >> > Interesting, I might have a look into this.
> >> >
> >> >>>
> >> >>>>
> >> >>>> 3) Ceph Cache Tiering. Network overhead and write amplification
> >> >>>> on promotion makes this primarily useful when workloads fit
> >> >>>> mostly into the cache tier.  Overall safe design but care must
> >> >>>> be taken to not over-
> >> >> promote.
> >> >>>>
> >> >>>> 4) separate SSD pool.  Manual and not particularly flexible, but
> >> >>>> perhaps
> >> >> best
> >> >>>> for applications that need consistently high performance.
> >> >>>
> >> >>> I think it depends on the definition of performance. Currently
> >> >>> even very
> >> >> fast CPU's and SSD's in their own pool will still struggle to get
> >> >> less than 1ms of write latency. If your performance requirements
> >> >> are for large queue depths then you will probably be alright. If
> >> >> you require something that mirrors the performance of traditional
> >> >> write back cache, then even pure SSD Pools can start to struggle.
> >> >>
> >> >> Agreed.  This is definitely the crux of the problem.  The example
> >> >> below is a great start!  It'd would be fantastic if we could get
> >> >> more feedback from the list on the relative importance of low
> >> >> latency operations vs high IOPS through concurrency.  We have
> >> >> general suspicions but not a ton of actual data regarding what
> >> >> folks are seeing in practice and under what scenarios.
> >> >>
> >> >
> >> > If you have any specific questions that you think I might be able
> >> > to
> > answer,
> >> please let me know. The only other main app that I can really think
> >> of
> > where
> >> these sort of write latency is critical is SQL, particularly the
> > transaction logs.
> >>
> >> Probably the big question is what are the pain points?  The most
> >> common answer we get when asking folks what applications they run on
> >> top of Ceph is "everything!".  This is wonderful, but not helpful
> >> when trying to
> > figure out
> >> what performance issues matter most! :)
> >
> > Sort of like someone telling you their pc is broken and when asked for
> > details getting "It's not working" in return.
> >
> > In general I think a lot of it comes down to people not appreciating
> > the differences between Ceph and say a Raid array. For most things
> > like larger block IO performance tends to scale with cluster size and
> > the cost effectiveness of Ceph makes this a no brainer not to just add
> > a handful of extra OSD's.
> >
> > I will try and be more precise. Here is my list of pain points /
> > wishes that I have come across in the last 12 months of running Ceph.
> >
> > 1. Improve small IO write latency
> > As discussed in depth in this thread. If it's possible just to make
> > Ceph a lot faster then great, but I fear even a doubling in
> > performance will still fall short compared to if you are caching
> > writes at the client. Most things in Ceph tend to improve with scale,
> > but write latency is the same with 2 OSD's as it is with 2000. I would
> > urge some sort of investigation into the possibility of some sort of
> > persistent librbd caching. This will probably help across a large
> > number of scenarios, as in the end, most things are effected by latency
and
> I think will provide across the board improvements.
> >
> > 2. Cache Tiering
> > I know a lot of work is going into this currently, but I will cover my
> > experience.
> > 2A)Deletion of large RBD's takes forever. It seems to have to promote
> > all objects, even non-existent ones to the cache tier before it can
delete
> them.
> > Operationally this is really poor as it has a negative effect on the
> > cache tier contents as well.
> > 2B) Erasure Coding requires all writes to be promoted 1st. I think it
> > should be pretty easy to allow proxy writes for erasure coded pools if
> > the IO size = Object Size. A lot of backup applications can be
> > configured to write out in static sized blocks and would be an ideal
> > candidate for this sort of enhancement.
> > 2C) General Performance, hopefully this will be fixed by upcoming
changes.
> > 2D) Don't count consecutive sequential reads to the same object as a
> > trigger for promotion. I currently have problems where reading
> > sequentially through a large RBD, causes it to be completely promoted
> > because the read IO size is smaller than the underlying object size.
> >
> > 3. Kernel RBD Client
> > Either implement striping or see if it's possible to configure
> > readahead
> > +max_sectors_kb size to be larger than the object size. I started a
> > +thread
> > about this a few days ago if you are interested in more details.
> >
> > 4. Disk based OSD with SSD Journal performance As I touched on above
> > earlier, I would expect a disk based OSD with SSD journal to have
> > similar performance to a pure SSD OSD when dealing with sequential
> > small IO's. Currently the levelDB sync and potentially other things
> > slow this down.
> >
> > 5. iSCSI
> > I know Mike Christie is doing a lot of good work in getting LIO to
> > work with Ceph, but currently it feels like a bit of a amateur affair
> > getting it going.
> >
> > 6. Slow xattr problem
> > I've a weird problem a couple of times, where RBD's with data that
> > hasn't been written to for a while seem to start performing reads very
> > slowly. With the help of Somnath in a thread here we managed to track
> > it down to a xattr taking very long to be retrieved, but no idea why.
> > Overwriting the RBD with fresh data seemed to stop it happening.
> > Hopefully Newstore might stop this happening in the future.
> >
> >>
> >> IE, should we be focusing on IOPS?  Latency?  Finding a way to avoid
> > journal
> >> overhead for large writes?  Are there specific use cases where we
> >> should specifically be focusing attention? general iscsi?  S3?
> >> databases directly on RBD? etc.  There's tons of different areas that
> >> we
> > can
> >> work on (general OSD threading improvements, different messenger
> >> implementations, newstore, client side bottlenecks, etc) but all of
> >> those things tackle different kinds of problems.
> >>
> >> Mark
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com