Re: any recommendation of using EnhanceIO?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 






> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Wang, Zhiqiang
> Sent: 01 September 2015 09:18
> To: Nick Fisk <nick@xxxxxxxxxx>; 'Samuel Just' <sjust@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  any recommendation of using EnhanceIO?
> 
> > -----Original Message-----
> > From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> > Sent: Tuesday, September 1, 2015 3:55 PM
> > To: Wang, Zhiqiang; 'Nick Fisk'; 'Samuel Just'
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Subject: RE:  any recommendation of using EnhanceIO?
> >
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > Behalf Of Wang, Zhiqiang
> > > Sent: 01 September 2015 02:48
> > > To: Nick Fisk <nick@xxxxxxxxxx>; 'Samuel Just' <sjust@xxxxxxxxxx>
> > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > Subject: Re:  any recommendation of using EnhanceIO?
> > >
> > > > -----Original Message-----
> > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > > Behalf Of Nick Fisk
> > > > Sent: Wednesday, August 19, 2015 5:25 AM
> > > > To: 'Samuel Just'
> > > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > > Subject: Re:  any recommendation of using EnhanceIO?
> > > >
> > > > Hi Sam,
> > > >
> > > > > -----Original Message-----
> > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > > > Behalf Of Samuel Just
> > > > > Sent: 18 August 2015 21:38
> > > > > To: Nick Fisk <nick@xxxxxxxxxx>
> > > > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > > > Subject: Re:  any recommendation of using EnhanceIO?
> > > > >
> > > > > 1.  We've kicked this around a bit.  What kind of failure
> > > > > semantics would
> > > > you
> > > > > be comfortable with here (that is, what would be reasonable
> > > > > behavior if
> > > > the
> > > > > client side cache fails)?
> > > >
> > > > I would either expect to provide the cache with a redundant block
> > > > device (ie
> > > > RAID1 SSD's) or the cache to allow itself to be configured to
> > > > mirror across two SSD's. Of course single SSD's can be used if the
> > > > user accepts
> > the
> > > risk.
> > > > If the cache did the mirroring then you could do fancy stuff like
> > > > mirror the writes, but leave the read cache blocks as single
> > > > copies to increase the cache capacity.
> > > >
> > > > In either case although an outage is undesirable, its only data
> > > > loss which would be unacceptable, which would hopefully be avoided
> > > > by the mirroring. As part of this, it would need to be a way to
> > > > make sure a "dirty" RBD can't be accessed unless the corresponding
> > > > cache is also
> > > attached.
> > > >
> > > > I guess as it caching the RBD and not the pool or entire cluster,
> > > > the cache only needs to match the failure requirements of the
> > > > application
> > its
> > > caching.
> > > > If I need to cache a RBD that is on  a single server, there is no
> > > > requirement to make the cache redundant across
> > > racks/PDU's/servers...etc.
> > > >
> > > > I hope I've answered your question?
> > > >
> > > >
> > > > > 2. We've got a branch which should merge soon (tomorrow
> > > > > probably) which actually does allow writes to be proxied, so
> > > > > that should alleviate some of these pain points somewhat.  I'm
> > > > > not sure it is clever enough to allow through writefulls for an
> > > > > ec base tier though (but it would be a good
> > > > idea!) -
> > > >
> > > > Excellent news, I shall look forward to testing in the future. I
> > > > did mention the proxy write for write fulls to someone who was
> > > > working on the proxy write code, but I'm not sure if it ever got
followed
> up.
> > >
> > > I think someone here is me. In the current code, for an ec base
> > > tier,
> > writefull
> > > can be proxied to the base.
> >
> > Excellent news. Is this intelligent enough to determine when say a
> > normal write IO from a RBD is equal to the underlying object size and
> > then turn this normal write effectively into a write full?
> 
> Checked the code, seems we don't do this right now... Would this be much
> helpful? I think we can do this if the answer is yes.

Hopefully yes. Erasure code is very suited to storing backups capacity wise
and in a lot of backup software you can configure it to write in static size
blocks, which could be set to the object size. With the current tiering code
you end up with a lot of IO amplification and poor performance, if the above
feature was possible, it should perform a lot better.

Does that make sense?

If you are also caching the RBD, through some sort of block cache like
mentioned in this thread, then small sequential writes could also be
assembled in cache and then flushed straight through to the erasure tier as
proxy full writes. This is probably less appealing than the backup case but
gives the same advantages as RAID5/6 when equipped with a battery backed
cache, which also has massive performance gains when able to write a full
stripe.

> 
> >
> > >
> > > >
> > > > > Sam
> > > > >
> > > > > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <nick@xxxxxxxxxx>
wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >> -----Original Message-----
> > > > > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
> > > > > >> On Behalf Of Mark Nelson
> > > > > >> Sent: 18 August 2015 18:51
> > > > > >> To: Nick Fisk <nick@xxxxxxxxxx>; 'Jan Schermer'
> > > > > >> <jan@xxxxxxxxxxx>
> > > > > >> Cc: ceph-users@xxxxxxxxxxxxxx
> > > > > >> Subject: Re:  any recommendation of using
> EnhanceIO?
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > > > > >> > <snip>
> > > > > >> >>>>
> > > > > >> >>>> Here's kind of how I see the field right now:
> > > > > >> >>>>
> > > > > >> >>>> 1) Cache at the client level.  Likely fastest but
> > > > > >> >>>> obvious issues like
> > > > > > above.
> > > > > >> >>>> RAID1 might be an option at increased cost.  Lack of
> > > > > >> >>>> barriers in some implementations scary.
> > > > > >> >>>
> > > > > >> >>> Agreed.
> > > > > >> >>>
> > > > > >> >>>>
> > > > > >> >>>> 2) Cache below the OSD.  Not much recent data on this.
> > > > > >> >>>> Not likely as fast as client side cache, but likely
> > > > > >> >>>> cheaper (fewer OSD nodes than client
> > > > > >> >> nodes?).
> > > > > >> >>>> Lack of barriers in some implementations scary.
> > > > > >> >>>
> > > > > >> >>> This also has the benefit of caching the leveldb on the
> > > > > >> >>> OSD, so get a big
> > > > > >> >> performance gain from there too for small sequential writes.
> > > > > >> >> I looked at using Flashcache for this too but decided it
> > > > > >> >> was adding to much complexity and risk.
> > > > > >> >>>
> > > > > >> >>> I thought I read somewhere that RocksDB allows you to
> > > > > >> >>> move its WAL to
> > > > > >> >> SSD, is there anything in the pipeline for something like
> > > > > >> >> moving the filestore to use RocksDB?
> > > > > >> >>
> > > > > >> >> I believe you can already do this, though I haven't tested
it.
> > > > > >> >> You can certainly move the monitors to rocksdb (tested)
> > > > > >> >> and newstore uses
> > > > > >> rocksdb as well.
> > > > > >> >>
> > > > > >> >
> > > > > >> > Interesting, I might have a look into this.
> > > > > >> >
> > > > > >> >>>
> > > > > >> >>>>
> > > > > >> >>>> 3) Ceph Cache Tiering. Network overhead and write
> > > > > >> >>>> amplification on promotion makes this primarily useful
> > > > > >> >>>> when workloads fit mostly into the cache tier.  Overall
> > > > > >> >>>> safe design but care must be taken to not over-
> > > > > >> >> promote.
> > > > > >> >>>>
> > > > > >> >>>> 4) separate SSD pool.  Manual and not particularly
> > > > > >> >>>> flexible, but perhaps
> > > > > >> >> best
> > > > > >> >>>> for applications that need consistently high performance.
> > > > > >> >>>
> > > > > >> >>> I think it depends on the definition of performance.
> > > > > >> >>> Currently even very
> > > > > >> >> fast CPU's and SSD's in their own pool will still struggle
> > > > > >> >> to get less than 1ms of write latency. If your performance
> > > > > >> >> requirements are for large queue depths then you will
> > > > > >> >> probably be alright. If you require something that mirrors
> > > > > >> >> the performance of traditional write back cache, then even
> > > > > >> >> pure SSD Pools
> > > > can start to struggle.
> > > > > >> >>
> > > > > >> >> Agreed.  This is definitely the crux of the problem.  The
> > > > > >> >> example below is a great start!  It'd would be fantastic
> > > > > >> >> if we could get more feedback from the list on the
> > > > > >> >> relative importance of low latency operations vs high IOPS
> > > > > >> >> through concurrency.  We have general suspicions but not a
> > > > > >> >> ton of actual data regarding what folks are seeing in
> > > > > >> >> practice and
> > under
> > > what scenarios.
> > > > > >> >>
> > > > > >> >
> > > > > >> > If you have any specific questions that you think I might
> > > > > >> > be able to
> > > > > > answer,
> > > > > >> please let me know. The only other main app that I can really
> > > > > >> think of
> > > > > > where
> > > > > >> these sort of write latency is critical is SQL, particularly
> > > > > >> the
> > > > > > transaction logs.
> > > > > >>
> > > > > >> Probably the big question is what are the pain points?  The
> > > > > >> most common answer we get when asking folks what applications
> > > > > >> they run on top of Ceph is "everything!".  This is wonderful,
> > > > > >> but not helpful when trying to
> > > > > > figure out
> > > > > >> what performance issues matter most! :)
> > > > > >
> > > > > > Sort of like someone telling you their pc is broken and when
> > > > > > asked for details getting "It's not working" in return.
> > > > > >
> > > > > > In general I think a lot of it comes down to people not
> > > > > > appreciating the differences between Ceph and say a Raid array.
> > > > > > For most things like larger block IO performance tends to
> > > > > > scale with cluster size and the cost effectiveness of Ceph
> > > > > > makes this a no brainer not to just add a handful of extra
OSD's.
> > > > > >
> > > > > > I will try and be more precise. Here is my list of pain points
> > > > > > / wishes that I have come across in the last 12 months of
> > > > > > running
> > Ceph.
> > > > > >
> > > > > > 1. Improve small IO write latency As discussed in depth in
> > > > > > this thread. If it's possible just to make Ceph a lot faster
> > > > > > then great, but I fear even a doubling in performance will
> > > > > > still fall short compared to if you are caching writes at the
> > > > > > client. Most things in Ceph tend to improve with scale, but
> > > > > > write latency is the same with 2 OSD's as it is with 2000. I
> > > > > > would urge some sort of investigation into the possibility of
> > > > > > some sort of persistent librbd caching. This will probably
> > > > > > help across a large number of scenarios, as in the end, most
> > > > > > things are effected by latency
> > > > and
> > > > > I think will provide across the board improvements.
> > > > > >
> > > > > > 2. Cache Tiering
> > > > > > I know a lot of work is going into this currently, but I will
> > > > > > cover my experience.
> > > > > > 2A)Deletion of large RBD's takes forever. It seems to have to
> > > > > > promote all objects, even non-existent ones to the cache tier
> > > > > > before it can
> > > > delete
> > > > > them.
> > > > > > Operationally this is really poor as it has a negative effect
> > > > > > on the cache tier contents as well.
> > > > > > 2B) Erasure Coding requires all writes to be promoted 1st. I
> > > > > > think it should be pretty easy to allow proxy writes for
> > > > > > erasure coded pools if the IO size = Object Size. A lot of
> > > > > > backup applications can be configured to write out in static
> > > > > > sized blocks and would be an ideal candidate for this sort of
> enhancement.
> > > > > > 2C) General Performance, hopefully this will be fixed by
> > > > > > upcoming
> > > > changes.
> > > > > > 2D) Don't count consecutive sequential reads to the same
> > > > > > object as a trigger for promotion. I currently have problems
> > > > > > where reading sequentially through a large RBD, causes it to
> > > > > > be completely promoted because the read IO size is smaller
> > > > > > than the underlying object
> > > > size.
> > > > > >
> > > > > > 3. Kernel RBD Client
> > > > > > Either implement striping or see if it's possible to configure
> > > > > > readahead
> > > > > > +max_sectors_kb size to be larger than the object size. I
> > > > > > +started a thread
> > > > > > about this a few days ago if you are interested in more details.
> > > > > >
> > > > > > 4. Disk based OSD with SSD Journal performance As I touched on
> > > > > > above earlier, I would expect a disk based OSD with SSD
> > > > > > journal to have similar performance to a pure SSD OSD when
> > > > > > dealing with sequential small IO's. Currently the levelDB sync
> > > > > > and potentially other things slow this down.
> > > > > >
> > > > > > 5. iSCSI
> > > > > > I know Mike Christie is doing a lot of good work in getting
> > > > > > LIO to work with Ceph, but currently it feels like a bit of a
> > > > > > amateur affair getting it going.
> > > > > >
> > > > > > 6. Slow xattr problem
> > > > > > I've a weird problem a couple of times, where RBD's with data
> > > > > > that hasn't been written to for a while seem to start
> > > > > > performing reads very slowly. With the help of Somnath in a
> > > > > > thread here we managed to track it down to a xattr taking very
> > > > > > long to be retrieved, but no
> > idea
> > > why.
> > > > > > Overwriting the RBD with fresh data seemed to stop it happening.
> > > > > > Hopefully Newstore might stop this happening in the future.
> > > > > >
> > > > > >>
> > > > > >> IE, should we be focusing on IOPS?  Latency?  Finding a way
> > > > > >> to avoid
> > > > > > journal
> > > > > >> overhead for large writes?  Are there specific use cases
> > > > > >> where we should specifically be focusing attention? general
iscsi?
> S3?
> > > > > >> databases directly on RBD? etc.  There's tons of different
> > > > > >> areas that we
> > > > > > can
> > > > > >> work on (general OSD threading improvements, different
> > > > > >> messenger implementations, newstore, client side bottlenecks,
> > > > > >> etc) but all of those things tackle different kinds of
problems.
> > > > > >>
> > > > > >> Mark
> > > > > >> _______________________________________________
> > > > > >> ceph-users mailing list
> > > > > >> ceph-users@xxxxxxxxxxxxxx
> > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux