Re: any recommendation of using EnhanceIO?

"Wang, Zhiqiang" <zhiqiang.wang@xxxxxxxxx> · Tue, 1 Sep 2015 01:47:42 +0000

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Nick Fisk
> Sent: Wednesday, August 19, 2015 5:25 AM
> To: 'Samuel Just'
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  any recommendation of using EnhanceIO?
> 
> Hi Sam,
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Samuel Just
> > Sent: 18 August 2015 21:38
> > To: Nick Fisk <nick@xxxxxxxxxx>
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  any recommendation of using EnhanceIO?
> >
> > 1.  We've kicked this around a bit.  What kind of failure semantics
> > would
> you
> > be comfortable with here (that is, what would be reasonable behavior
> > if
> the
> > client side cache fails)?
> 
> I would either expect to provide the cache with a redundant block device (ie
> RAID1 SSD's) or the cache to allow itself to be configured to mirror across two
> SSD's. Of course single SSD's can be used if the user accepts the risk.
> If the cache did the mirroring then you could do fancy stuff like mirror the
> writes, but leave the read cache blocks as single copies to increase the cache
> capacity.
> 
> In either case although an outage is undesirable, its only data loss which would
> be unacceptable, which would hopefully be avoided by the mirroring. As part of
> this, it would need to be a way to make sure a "dirty" RBD can't be accessed
> unless the corresponding cache is also attached.
> 
> I guess as it caching the RBD and not the pool or entire cluster, the cache only
> needs to match the failure requirements of the application its caching.
> If I need to cache a RBD that is on  a single server, there is no requirement to
> make the cache redundant across racks/PDU's/servers...etc.
> 
> I hope I've answered your question?
> 
> 
> > 2. We've got a branch which should merge soon (tomorrow probably)
> > which actually does allow writes to be proxied, so that should
> > alleviate some of these pain points somewhat.  I'm not sure it is
> > clever enough to allow through writefulls for an ec base tier though
> > (but it would be a good
> idea!) -
> 
> Excellent news, I shall look forward to testing in the future. I did mention the
> proxy write for write fulls to someone who was working on the proxy write code,
> but I'm not sure if it ever got followed up.

I think someone here is me. In the current code, for an ec base tier, writefull can be proxied to the base.

> 
> > Sam
> >
> > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > >
> > >
> > >
> > >
> > >> -----Original Message-----
> > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > >> Behalf Of Mark Nelson
> > >> Sent: 18 August 2015 18:51
> > >> To: Nick Fisk <nick@xxxxxxxxxx>; 'Jan Schermer' <jan@xxxxxxxxxxx>
> > >> Cc: ceph-users@xxxxxxxxxxxxxx
> > >> Subject: Re:  any recommendation of using EnhanceIO?
> > >>
> > >>
> > >>
> > >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > >> > <snip>
> > >> >>>>
> > >> >>>> Here's kind of how I see the field right now:
> > >> >>>>
> > >> >>>> 1) Cache at the client level.  Likely fastest but obvious
> > >> >>>> issues like
> > > above.
> > >> >>>> RAID1 might be an option at increased cost.  Lack of barriers
> > >> >>>> in some implementations scary.
> > >> >>>
> > >> >>> Agreed.
> > >> >>>
> > >> >>>>
> > >> >>>> 2) Cache below the OSD.  Not much recent data on this.  Not
> > >> >>>> likely as fast as client side cache, but likely cheaper (fewer
> > >> >>>> OSD nodes than client
> > >> >> nodes?).
> > >> >>>> Lack of barriers in some implementations scary.
> > >> >>>
> > >> >>> This also has the benefit of caching the leveldb on the OSD, so
> > >> >>> get a big
> > >> >> performance gain from there too for small sequential writes. I
> > >> >> looked at using Flashcache for this too but decided it was
> > >> >> adding to much complexity and risk.
> > >> >>>
> > >> >>> I thought I read somewhere that RocksDB allows you to move its
> > >> >>> WAL to
> > >> >> SSD, is there anything in the pipeline for something like moving
> > >> >> the filestore to use RocksDB?
> > >> >>
> > >> >> I believe you can already do this, though I haven't tested it.
> > >> >> You can certainly move the monitors to rocksdb (tested) and
> > >> >> newstore uses
> > >> rocksdb as well.
> > >> >>
> > >> >
> > >> > Interesting, I might have a look into this.
> > >> >
> > >> >>>
> > >> >>>>
> > >> >>>> 3) Ceph Cache Tiering. Network overhead and write
> > >> >>>> amplification on promotion makes this primarily useful when
> > >> >>>> workloads fit mostly into the cache tier.  Overall safe design
> > >> >>>> but care must be taken to not over-
> > >> >> promote.
> > >> >>>>
> > >> >>>> 4) separate SSD pool.  Manual and not particularly flexible,
> > >> >>>> but perhaps
> > >> >> best
> > >> >>>> for applications that need consistently high performance.
> > >> >>>
> > >> >>> I think it depends on the definition of performance. Currently
> > >> >>> even very
> > >> >> fast CPU's and SSD's in their own pool will still struggle to
> > >> >> get less than 1ms of write latency. If your performance
> > >> >> requirements are for large queue depths then you will probably
> > >> >> be alright. If you require something that mirrors the
> > >> >> performance of traditional write back cache, then even pure SSD Pools
> can start to struggle.
> > >> >>
> > >> >> Agreed.  This is definitely the crux of the problem.  The
> > >> >> example below is a great start!  It'd would be fantastic if we
> > >> >> could get more feedback from the list on the relative importance
> > >> >> of low latency operations vs high IOPS through concurrency.  We
> > >> >> have general suspicions but not a ton of actual data regarding
> > >> >> what folks are seeing in practice and under what scenarios.
> > >> >>
> > >> >
> > >> > If you have any specific questions that you think I might be able
> > >> > to
> > > answer,
> > >> please let me know. The only other main app that I can really think
> > >> of
> > > where
> > >> these sort of write latency is critical is SQL, particularly the
> > > transaction logs.
> > >>
> > >> Probably the big question is what are the pain points?  The most
> > >> common answer we get when asking folks what applications they run
> > >> on top of Ceph is "everything!".  This is wonderful, but not
> > >> helpful when trying to
> > > figure out
> > >> what performance issues matter most! :)
> > >
> > > Sort of like someone telling you their pc is broken and when asked
> > > for details getting "It's not working" in return.
> > >
> > > In general I think a lot of it comes down to people not appreciating
> > > the differences between Ceph and say a Raid array. For most things
> > > like larger block IO performance tends to scale with cluster size
> > > and the cost effectiveness of Ceph makes this a no brainer not to
> > > just add a handful of extra OSD's.
> > >
> > > I will try and be more precise. Here is my list of pain points /
> > > wishes that I have come across in the last 12 months of running Ceph.
> > >
> > > 1. Improve small IO write latency
> > > As discussed in depth in this thread. If it's possible just to make
> > > Ceph a lot faster then great, but I fear even a doubling in
> > > performance will still fall short compared to if you are caching
> > > writes at the client. Most things in Ceph tend to improve with
> > > scale, but write latency is the same with 2 OSD's as it is with
> > > 2000. I would urge some sort of investigation into the possibility
> > > of some sort of persistent librbd caching. This will probably help
> > > across a large number of scenarios, as in the end, most things are
> > > effected by latency
> and
> > I think will provide across the board improvements.
> > >
> > > 2. Cache Tiering
> > > I know a lot of work is going into this currently, but I will cover
> > > my experience.
> > > 2A)Deletion of large RBD's takes forever. It seems to have to
> > > promote all objects, even non-existent ones to the cache tier before
> > > it can
> delete
> > them.
> > > Operationally this is really poor as it has a negative effect on the
> > > cache tier contents as well.
> > > 2B) Erasure Coding requires all writes to be promoted 1st. I think
> > > it should be pretty easy to allow proxy writes for erasure coded
> > > pools if the IO size = Object Size. A lot of backup applications can
> > > be configured to write out in static sized blocks and would be an
> > > ideal candidate for this sort of enhancement.
> > > 2C) General Performance, hopefully this will be fixed by upcoming
> changes.
> > > 2D) Don't count consecutive sequential reads to the same object as a
> > > trigger for promotion. I currently have problems where reading
> > > sequentially through a large RBD, causes it to be completely
> > > promoted because the read IO size is smaller than the underlying object
> size.
> > >
> > > 3. Kernel RBD Client
> > > Either implement striping or see if it's possible to configure
> > > readahead
> > > +max_sectors_kb size to be larger than the object size. I started a
> > > +thread
> > > about this a few days ago if you are interested in more details.
> > >
> > > 4. Disk based OSD with SSD Journal performance As I touched on above
> > > earlier, I would expect a disk based OSD with SSD journal to have
> > > similar performance to a pure SSD OSD when dealing with sequential
> > > small IO's. Currently the levelDB sync and potentially other things
> > > slow this down.
> > >
> > > 5. iSCSI
> > > I know Mike Christie is doing a lot of good work in getting LIO to
> > > work with Ceph, but currently it feels like a bit of a amateur
> > > affair getting it going.
> > >
> > > 6. Slow xattr problem
> > > I've a weird problem a couple of times, where RBD's with data that
> > > hasn't been written to for a while seem to start performing reads
> > > very slowly. With the help of Somnath in a thread here we managed to
> > > track it down to a xattr taking very long to be retrieved, but no idea why.
> > > Overwriting the RBD with fresh data seemed to stop it happening.
> > > Hopefully Newstore might stop this happening in the future.
> > >
> > >>
> > >> IE, should we be focusing on IOPS?  Latency?  Finding a way to
> > >> avoid
> > > journal
> > >> overhead for large writes?  Are there specific use cases where we
> > >> should specifically be focusing attention? general iscsi?  S3?
> > >> databases directly on RBD? etc.  There's tons of different areas
> > >> that we
> > > can
> > >> work on (general OSD threading improvements, different messenger
> > >> implementations, newstore, client side bottlenecks, etc) but all of
> > >> those things tackle different kinds of problems.
> > >>
> > >> Mark
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-users@xxxxxxxxxxxxxx
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com