Re: any recommendation of using EnhanceIO?

Nick Fisk <nick@xxxxxxxxxx> · Tue, 1 Sep 2015 10:07:43 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Wang, Zhiqiang
> Sent: 01 September 2015 09:48
> To: Nick Fisk <nick@xxxxxxxxxx>; 'Samuel Just' <sjust@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  any recommendation of using EnhanceIO?
> 
> > -----Original Message-----
> > From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> > Sent: Tuesday, September 1, 2015 4:37 PM
> > To: Wang, Zhiqiang; 'Samuel Just'
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Subject: RE:  any recommendation of using EnhanceIO?
> >
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > Behalf Of Wang, Zhiqiang
> > > Sent: 01 September 2015 09:18
> > > To: Nick Fisk <nick@xxxxxxxxxx>; 'Samuel Just' <sjust@xxxxxxxxxx>
> > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > Subject: Re:  any recommendation of using EnhanceIO?
> > >
> > > > -----Original Message-----
> > > > From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> > > > Sent: Tuesday, September 1, 2015 3:55 PM
> > > > To: Wang, Zhiqiang; 'Nick Fisk'; 'Samuel Just'
> > > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > > Subject: RE:  any recommendation of using EnhanceIO?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > > > Behalf Of Wang, Zhiqiang
> > > > > Sent: 01 September 2015 02:48
> > > > > To: Nick Fisk <nick@xxxxxxxxxx>; 'Samuel Just'
> > > > > <sjust@xxxxxxxxxx>
> > > > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > > > Subject: Re:  any recommendation of using EnhanceIO?
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > > > > Behalf Of Nick Fisk
> > > > > > Sent: Wednesday, August 19, 2015 5:25 AM
> > > > > > To: 'Samuel Just'
> > > > > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > > > > Subject: Re:  any recommendation of using EnhanceIO?
> > > > > >
> > > > > > Hi Sam,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
> > > > > > > On Behalf Of Samuel Just
> > > > > > > Sent: 18 August 2015 21:38
> > > > > > > To: Nick Fisk <nick@xxxxxxxxxx>
> > > > > > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > > > > > Subject: Re:  any recommendation of using
> EnhanceIO?
> > > > > > >
> > > > > > > 1.  We've kicked this around a bit.  What kind of failure
> > > > > > > semantics would
> > > > > > you
> > > > > > > be comfortable with here (that is, what would be reasonable
> > > > > > > behavior if
> > > > > > the
> > > > > > > client side cache fails)?
> > > > > >
> > > > > > I would either expect to provide the cache with a redundant
> > > > > > block device (ie
> > > > > > RAID1 SSD's) or the cache to allow itself to be configured to
> > > > > > mirror across two SSD's. Of course single SSD's can be used if
> > > > > > the user accepts
> > > > the
> > > > > risk.
> > > > > > If the cache did the mirroring then you could do fancy stuff
> > > > > > like mirror the writes, but leave the read cache blocks as
> > > > > > single copies to increase the cache capacity.
> > > > > >
> > > > > > In either case although an outage is undesirable, its only
> > > > > > data loss which would be unacceptable, which would hopefully
> > > > > > be avoided by the mirroring. As part of this, it would need to
> > > > > > be a way to make sure a "dirty" RBD can't be accessed unless
> > > > > > the corresponding cache is also
> > > > > attached.
> > > > > >
> > > > > > I guess as it caching the RBD and not the pool or entire
> > > > > > cluster, the cache only needs to match the failure
> > > > > > requirements of the application
> > > > its
> > > > > caching.
> > > > > > If I need to cache a RBD that is on  a single server, there is
> > > > > > no requirement to make the cache redundant across
> > > > > racks/PDU's/servers...etc.
> > > > > >
> > > > > > I hope I've answered your question?
> > > > > >
> > > > > >
> > > > > > > 2. We've got a branch which should merge soon (tomorrow
> > > > > > > probably) which actually does allow writes to be proxied, so
> > > > > > > that should alleviate some of these pain points somewhat.
> > > > > > > I'm not sure it is clever enough to allow through writefulls
> > > > > > > for an ec base tier though (but it would be a good
> > > > > > idea!) -
> > > > > >
> > > > > > Excellent news, I shall look forward to testing in the future.
> > > > > > I did mention the proxy write for write fulls to someone who
> > > > > > was working on the proxy write code, but I'm not sure if it
> > > > > > ever got
> > followed
> > > up.
> > > > >
> > > > > I think someone here is me. In the current code, for an ec base
> > > > > tier,
> > > > writefull
> > > > > can be proxied to the base.
> > > >
> > > > Excellent news. Is this intelligent enough to determine when say a
> > > > normal write IO from a RBD is equal to the underlying object size
> > > > and then turn this normal write effectively into a write full?
> > >
> > > Checked the code, seems we don't do this right now... Would this be
> > > much helpful? I think we can do this if the answer is yes.
> >
> > Hopefully yes. Erasure code is very suited to storing backups capacity
> > wise and in a lot of backup software you can configure it to write in
> > static size blocks, which could be set to the object size. With the
> > current tiering code you end up with a lot of IO amplification and
> > poor performance, if the above feature was possible, it should perform a
> lot better.
> >
> > Does that make sense?
> 
> Yep. It makes sense in this case. Actually, the backup software doesn't
need
> to write in units of object size. As long as it spans a full object, then
this
> object can be written in writefull. I'll see if I can come up with an
> implementation of this.

Awesome, thanks for your interest in this.

> 
> >
> > If you are also caching the RBD, through some sort of block cache like
> > mentioned in this thread, then small sequential writes could also be
> > assembled in cache and then flushed straight through to the erasure
> > tier as proxy full writes. This is probably less appealing than the
> > backup case but gives the same advantages as RAID5/6 when equipped
> > with a battery backed cache, which also has massive performance gains
> when able to write a full stripe.
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > Sam
> > > > > > >
> > > > > > > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk
> > > > > > > <nick@xxxxxxxxxx>
> > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >> -----Original Message-----
> > > > > > > >> From: ceph-users
> > > > > > > >> [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
> > > > > > > >> On Behalf Of Mark Nelson
> > > > > > > >> Sent: 18 August 2015 18:51
> > > > > > > >> To: Nick Fisk <nick@xxxxxxxxxx>; 'Jan Schermer'
> > > > > > > >> <jan@xxxxxxxxxxx>
> > > > > > > >> Cc: ceph-users@xxxxxxxxxxxxxx
> > > > > > > >> Subject: Re:  any recommendation of using
> > > EnhanceIO?
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > > > > > > >> > <snip>
> > > > > > > >> >>>>
> > > > > > > >> >>>> Here's kind of how I see the field right now:
> > > > > > > >> >>>>
> > > > > > > >> >>>> 1) Cache at the client level.  Likely fastest but
> > > > > > > >> >>>> obvious issues like
> > > > > > > > above.
> > > > > > > >> >>>> RAID1 might be an option at increased cost.  Lack of
> > > > > > > >> >>>> barriers in some implementations scary.
> > > > > > > >> >>>
> > > > > > > >> >>> Agreed.
> > > > > > > >> >>>
> > > > > > > >> >>>>
> > > > > > > >> >>>> 2) Cache below the OSD.  Not much recent data on this.
> > > > > > > >> >>>> Not likely as fast as client side cache, but likely
> > > > > > > >> >>>> cheaper (fewer OSD nodes than client
> > > > > > > >> >> nodes?).
> > > > > > > >> >>>> Lack of barriers in some implementations scary.
> > > > > > > >> >>>
> > > > > > > >> >>> This also has the benefit of caching the leveldb on
> > > > > > > >> >>> the OSD, so get a big
> > > > > > > >> >> performance gain from there too for small sequential
> writes.
> > > > > > > >> >> I looked at using Flashcache for this too but decided
> > > > > > > >> >> it was adding to much complexity and risk.
> > > > > > > >> >>>
> > > > > > > >> >>> I thought I read somewhere that RocksDB allows you to
> > > > > > > >> >>> move its WAL to
> > > > > > > >> >> SSD, is there anything in the pipeline for something
> > > > > > > >> >> like moving the filestore to use RocksDB?
> > > > > > > >> >>
> > > > > > > >> >> I believe you can already do this, though I haven't
> > > > > > > >> >> tested
> > it.
> > > > > > > >> >> You can certainly move the monitors to rocksdb
> > > > > > > >> >> (tested) and newstore uses
> > > > > > > >> rocksdb as well.
> > > > > > > >> >>
> > > > > > > >> >
> > > > > > > >> > Interesting, I might have a look into this.
> > > > > > > >> >
> > > > > > > >> >>>
> > > > > > > >> >>>>
> > > > > > > >> >>>> 3) Ceph Cache Tiering. Network overhead and write
> > > > > > > >> >>>> amplification on promotion makes this primarily
> > > > > > > >> >>>> useful when workloads fit mostly into the cache tier.
> > > > > > > >> >>>> Overall safe design but care must be taken to not
> > > > > > > >> >>>> over-
> > > > > > > >> >> promote.
> > > > > > > >> >>>>
> > > > > > > >> >>>> 4) separate SSD pool.  Manual and not particularly
> > > > > > > >> >>>> flexible, but perhaps
> > > > > > > >> >> best
> > > > > > > >> >>>> for applications that need consistently high
performance.
> > > > > > > >> >>>
> > > > > > > >> >>> I think it depends on the definition of performance.
> > > > > > > >> >>> Currently even very
> > > > > > > >> >> fast CPU's and SSD's in their own pool will still
> > > > > > > >> >> struggle to get less than 1ms of write latency. If
> > > > > > > >> >> your performance requirements are for large queue
> > > > > > > >> >> depths then you will probably be alright. If you
> > > > > > > >> >> require something that mirrors the performance of
> > > > > > > >> >> traditional write back cache, then even pure SSD Pools
> > > > > > can start to struggle.
> > > > > > > >> >>
> > > > > > > >> >> Agreed.  This is definitely the crux of the problem.
> > > > > > > >> >> The example below is a great start!  It'd would be
> > > > > > > >> >> fantastic if we could get more feedback from the list
> > > > > > > >> >> on the relative importance of low latency operations
> > > > > > > >> >> vs high IOPS through concurrency.  We have general
> > > > > > > >> >> suspicions but not a ton of actual data regarding what
> > > > > > > >> >> folks are seeing in practice and
> > > > under
> > > > > what scenarios.
> > > > > > > >> >>
> > > > > > > >> >
> > > > > > > >> > If you have any specific questions that you think I
> > > > > > > >> > might be able to
> > > > > > > > answer,
> > > > > > > >> please let me know. The only other main app that I can
> > > > > > > >> really think of
> > > > > > > > where
> > > > > > > >> these sort of write latency is critical is SQL,
> > > > > > > >> particularly the
> > > > > > > > transaction logs.
> > > > > > > >>
> > > > > > > >> Probably the big question is what are the pain points?
> > > > > > > >> The most common answer we get when asking folks what
> > > > > > > >> applications they run on top of Ceph is "everything!".
> > > > > > > >> This is wonderful, but not helpful when trying to
> > > > > > > > figure out
> > > > > > > >> what performance issues matter most! :)
> > > > > > > >
> > > > > > > > Sort of like someone telling you their pc is broken and
> > > > > > > > when asked for details getting "It's not working" in return.
> > > > > > > >
> > > > > > > > In general I think a lot of it comes down to people not
> > > > > > > > appreciating the differences between Ceph and say a Raid
array.
> > > > > > > > For most things like larger block IO performance tends to
> > > > > > > > scale with cluster size and the cost effectiveness of Ceph
> > > > > > > > makes this a no brainer not to just add a handful of extra
> > OSD's.
> > > > > > > >
> > > > > > > > I will try and be more precise. Here is my list of pain
> > > > > > > > points / wishes that I have come across in the last 12
> > > > > > > > months of running
> > > > Ceph.
> > > > > > > >
> > > > > > > > 1. Improve small IO write latency As discussed in depth in
> > > > > > > > this thread. If it's possible just to make Ceph a lot
> > > > > > > > faster then great, but I fear even a doubling in
> > > > > > > > performance will still fall short compared to if you are
> > > > > > > > caching writes at the client. Most things in Ceph tend to
> > > > > > > > improve with scale, but write latency is the same with 2
> > > > > > > > OSD's as it is with 2000. I would urge some sort of
> > > > > > > > investigation into the possibility of some sort of
> > > > > > > > persistent librbd caching. This will probably help across
> > > > > > > > a large number of scenarios, as in the end, most things
> > > > > > > > are effected by latency
> > > > > > and
> > > > > > > I think will provide across the board improvements.
> > > > > > > >
> > > > > > > > 2. Cache Tiering
> > > > > > > > I know a lot of work is going into this currently, but I
> > > > > > > > will cover my experience.
> > > > > > > > 2A)Deletion of large RBD's takes forever. It seems to have
> > > > > > > > to promote all objects, even non-existent ones to the
> > > > > > > > cache tier before it can
> > > > > > delete
> > > > > > > them.
> > > > > > > > Operationally this is really poor as it has a negative
> > > > > > > > effect on the cache tier contents as well.
> > > > > > > > 2B) Erasure Coding requires all writes to be promoted 1st.
> > > > > > > > I think it should be pretty easy to allow proxy writes for
> > > > > > > > erasure coded pools if the IO size = Object Size. A lot of
> > > > > > > > backup applications can be configured to write out in
> > > > > > > > static sized blocks and would be an ideal candidate for
> > > > > > > > this sort of
> > > enhancement.
> > > > > > > > 2C) General Performance, hopefully this will be fixed by
> > > > > > > > upcoming
> > > > > > changes.
> > > > > > > > 2D) Don't count consecutive sequential reads to the same
> > > > > > > > object as a trigger for promotion. I currently have
> > > > > > > > problems where reading sequentially through a large RBD,
> > > > > > > > causes it to be completely promoted because the read IO
> > > > > > > > size is smaller than the underlying object
> > > > > > size.
> > > > > > > >
> > > > > > > > 3. Kernel RBD Client
> > > > > > > > Either implement striping or see if it's possible to
> > > > > > > > configure readahead
> > > > > > > > +max_sectors_kb size to be larger than the object size. I
> > > > > > > > +started a thread
> > > > > > > > about this a few days ago if you are interested in more
details.
> > > > > > > >
> > > > > > > > 4. Disk based OSD with SSD Journal performance As I
> > > > > > > > touched on above earlier, I would expect a disk based OSD
> > > > > > > > with SSD journal to have similar performance to a pure SSD
> > > > > > > > OSD when dealing with sequential small IO's. Currently the
> > > > > > > > levelDB sync and potentially other things slow this down.
> > > > > > > >
> > > > > > > > 5. iSCSI
> > > > > > > > I know Mike Christie is doing a lot of good work in
> > > > > > > > getting LIO to work with Ceph, but currently it feels like
> > > > > > > > a bit of a amateur affair getting it going.
> > > > > > > >
> > > > > > > > 6. Slow xattr problem
> > > > > > > > I've a weird problem a couple of times, where RBD's with
> > > > > > > > data that hasn't been written to for a while seem to start
> > > > > > > > performing reads very slowly. With the help of Somnath in
> > > > > > > > a thread here we managed to track it down to a xattr
> > > > > > > > taking very long to be retrieved, but no
> > > > idea
> > > > > why.
> > > > > > > > Overwriting the RBD with fresh data seemed to stop it
> happening.
> > > > > > > > Hopefully Newstore might stop this happening in the future.
> > > > > > > >
> > > > > > > >>
> > > > > > > >> IE, should we be focusing on IOPS?  Latency?  Finding a
> > > > > > > >> way to avoid
> > > > > > > > journal
> > > > > > > >> overhead for large writes?  Are there specific use cases
> > > > > > > >> where we should specifically be focusing attention?
> > > > > > > >> general
> > iscsi?
> > > S3?
> > > > > > > >> databases directly on RBD? etc.  There's tons of
> > > > > > > >> different areas that we
> > > > > > > > can
> > > > > > > >> work on (general OSD threading improvements, different
> > > > > > > >> messenger implementations, newstore, client side
> > > > > > > >> bottlenecks,
> > > > > > > >> etc) but all of those things tackle different kinds of
> > problems.
> > > > > > > >>
> > > > > > > >> Mark
> > > > > > > >> _______________________________________________
> > > > > > > >> ceph-users mailing list
> > > > > > > >> ceph-users@xxxxxxxxxxxxxx
> > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > ceph-users mailing list
> > > > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > >
> > > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com