Re: any recommendation of using EnhanceIO?

Nick Fisk <nick@xxxxxxxxxx> · Tue, 18 Aug 2015 20:48:26 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Mark Nelson
> Sent: 18 August 2015 18:51
> To: Nick Fisk <nick@xxxxxxxxxx>; 'Jan Schermer' <jan@xxxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  any recommendation of using EnhanceIO?
> 
> 
> 
> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > <snip>
> >>>>
> >>>> Here's kind of how I see the field right now:
> >>>>
> >>>> 1) Cache at the client level.  Likely fastest but obvious issues like
above.
> >>>> RAID1 might be an option at increased cost.  Lack of barriers in
> >>>> some implementations scary.
> >>>
> >>> Agreed.
> >>>
> >>>>
> >>>> 2) Cache below the OSD.  Not much recent data on this.  Not likely
> >>>> as fast as client side cache, but likely cheaper (fewer OSD nodes
> >>>> than client
> >> nodes?).
> >>>> Lack of barriers in some implementations scary.
> >>>
> >>> This also has the benefit of caching the leveldb on the OSD, so get
> >>> a big
> >> performance gain from there too for small sequential writes. I looked
> >> at using Flashcache for this too but decided it was adding to much
> >> complexity and risk.
> >>>
> >>> I thought I read somewhere that RocksDB allows you to move its WAL
> >>> to
> >> SSD, is there anything in the pipeline for something like moving the
> >> filestore to use RocksDB?
> >>
> >> I believe you can already do this, though I haven't tested it.  You
> >> can certainly move the monitors to rocksdb (tested) and newstore uses
> rocksdb as well.
> >>
> >
> > Interesting, I might have a look into this.
> >
> >>>
> >>>>
> >>>> 3) Ceph Cache Tiering. Network overhead and write amplification on
> >>>> promotion makes this primarily useful when workloads fit mostly
> >>>> into the cache tier.  Overall safe design but care must be taken to
> >>>> not over-
> >> promote.
> >>>>
> >>>> 4) separate SSD pool.  Manual and not particularly flexible, but
> >>>> perhaps
> >> best
> >>>> for applications that need consistently high performance.
> >>>
> >>> I think it depends on the definition of performance. Currently even
> >>> very
> >> fast CPU's and SSD's in their own pool will still struggle to get
> >> less than 1ms of write latency. If your performance requirements are
> >> for large queue depths then you will probably be alright. If you
> >> require something that mirrors the performance of traditional write
> >> back cache, then even pure SSD Pools can start to struggle.
> >>
> >> Agreed.  This is definitely the crux of the problem.  The example
> >> below is a great start!  It'd would be fantastic if we could get more
> >> feedback from the list on the relative importance of low latency
> >> operations vs high IOPS through concurrency.  We have general
> >> suspicions but not a ton of actual data regarding what folks are
> >> seeing in practice and under what scenarios.
> >>
> >
> > If you have any specific questions that you think I might be able to
answer,
> please let me know. The only other main app that I can really think of
where
> these sort of write latency is critical is SQL, particularly the
transaction logs.
> 
> Probably the big question is what are the pain points?  The most common
> answer we get when asking folks what applications they run on top of Ceph
> is "everything!".  This is wonderful, but not helpful when trying to
figure out
> what performance issues matter most! :)

Sort of like someone telling you their pc is broken and when asked for
details getting "It's not working" in return. 

In general I think a lot of it comes down to people not appreciating the
differences between Ceph and say a Raid array. For most things like larger
block IO performance tends to scale with cluster size and the cost
effectiveness of Ceph makes this a no brainer not to just add a handful of
extra OSD's.

I will try and be more precise. Here is my list of pain points / wishes that
I have come across in the last 12 months of running Ceph.

1. Improve small IO write latency
As discussed in depth in this thread. If it's possible just to make Ceph a
lot faster then great, but I fear even a doubling in performance will still
fall short compared to if you are caching writes at the client. Most things
in Ceph tend to improve with scale, but write latency is the same with 2
OSD's as it is with 2000. I would urge some sort of investigation into the
possibility of some sort of persistent librbd caching. This will probably
help across a large number of scenarios, as in the end, most things are
effected by latency and I think will provide across the board improvements.

2. Cache Tiering
I know a lot of work is going into this currently, but I will cover my
experience. 
2A)Deletion of large RBD's takes forever. It seems to have to promote all
objects, even non-existent ones to the cache tier before it can delete them.
Operationally this is really poor as it has a negative effect on the cache
tier contents as well.
2B) Erasure Coding requires all writes to be promoted 1st. I think it should
be pretty easy to allow proxy writes for erasure coded pools if the IO size
= Object Size. A lot of backup applications can be configured to write out
in static sized blocks and would be an ideal candidate for this sort of
enhancement.
2C) General Performance, hopefully this will be fixed by upcoming changes.
2D) Don't count consecutive sequential reads to the same object as a trigger
for promotion. I currently have problems where reading sequentially through
a large RBD, causes it to be completely promoted because the read IO size is
smaller than the underlying object size.

3. Kernel RBD Client
Either implement striping or see if it's possible to configure readahead
+max_sectors_kb size to be larger than the object size. I started a thread
about this a few days ago if you are interested in more details.

4. Disk based OSD with SSD Journal performance
As I touched on above earlier, I would expect a disk based OSD with SSD
journal to have similar performance to a pure SSD OSD when dealing with
sequential small IO's. Currently the levelDB sync and potentially other
things slow this down.

5. iSCSI
I know Mike Christie is doing a lot of good work in getting LIO to work with
Ceph, but currently it feels like a bit of a amateur affair getting it
going.

6. Slow xattr problem 
I've a weird problem a couple of times, where RBD's with data that hasn't
been written to for a while seem to start performing reads very slowly. With
the help of Somnath in a thread here we managed to track it down to a xattr
taking very long to be retrieved, but no idea why. Overwriting the RBD with
fresh data seemed to stop it happening. Hopefully Newstore might stop this
happening in the future.

> 
> IE, should we be focusing on IOPS?  Latency?  Finding a way to avoid
journal
> overhead for large writes?  Are there specific use cases where we should
> specifically be focusing attention? general iscsi?  S3?
> databases directly on RBD? etc.  There's tons of different areas that we
can
> work on (general OSD threading improvements, different messenger
> implementations, newstore, client side bottlenecks, etc) but all of those
> things tackle different kinds of problems.
> 
> Mark
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com