> -----Original Message----- > From: Nick Fisk [mailto:nick@xxxxxxxxxx] > Sent: Tuesday, September 1, 2015 4:37 PM > To: Wang, Zhiqiang; 'Samuel Just' > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: RE: any recommendation of using EnhanceIO? > > > > > > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > > Of Wang, Zhiqiang > > Sent: 01 September 2015 09:18 > > To: Nick Fisk <nick@xxxxxxxxxx>; 'Samuel Just' <sjust@xxxxxxxxxx> > > Cc: ceph-users@xxxxxxxxxxxxxx > > Subject: Re: any recommendation of using EnhanceIO? > > > > > -----Original Message----- > > > From: Nick Fisk [mailto:nick@xxxxxxxxxx] > > > Sent: Tuesday, September 1, 2015 3:55 PM > > > To: Wang, Zhiqiang; 'Nick Fisk'; 'Samuel Just' > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > Subject: RE: any recommendation of using EnhanceIO? > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > > > Behalf Of Wang, Zhiqiang > > > > Sent: 01 September 2015 02:48 > > > > To: Nick Fisk <nick@xxxxxxxxxx>; 'Samuel Just' <sjust@xxxxxxxxxx> > > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > > Subject: Re: any recommendation of using EnhanceIO? > > > > > > > > > -----Original Message----- > > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > > > > Behalf Of Nick Fisk > > > > > Sent: Wednesday, August 19, 2015 5:25 AM > > > > > To: 'Samuel Just' > > > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > > > Subject: Re: any recommendation of using EnhanceIO? > > > > > > > > > > Hi Sam, > > > > > > > > > > > -----Original Message----- > > > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > > > > > Behalf Of Samuel Just > > > > > > Sent: 18 August 2015 21:38 > > > > > > To: Nick Fisk <nick@xxxxxxxxxx> > > > > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > > > > Subject: Re: any recommendation of using EnhanceIO? > > > > > > > > > > > > 1. We've kicked this around a bit. What kind of failure > > > > > > semantics would > > > > > you > > > > > > be comfortable with here (that is, what would be reasonable > > > > > > behavior if > > > > > the > > > > > > client side cache fails)? > > > > > > > > > > I would either expect to provide the cache with a redundant > > > > > block device (ie > > > > > RAID1 SSD's) or the cache to allow itself to be configured to > > > > > mirror across two SSD's. Of course single SSD's can be used if > > > > > the user accepts > > > the > > > > risk. > > > > > If the cache did the mirroring then you could do fancy stuff > > > > > like mirror the writes, but leave the read cache blocks as > > > > > single copies to increase the cache capacity. > > > > > > > > > > In either case although an outage is undesirable, its only data > > > > > loss which would be unacceptable, which would hopefully be > > > > > avoided by the mirroring. As part of this, it would need to be a > > > > > way to make sure a "dirty" RBD can't be accessed unless the > > > > > corresponding cache is also > > > > attached. > > > > > > > > > > I guess as it caching the RBD and not the pool or entire > > > > > cluster, the cache only needs to match the failure requirements > > > > > of the application > > > its > > > > caching. > > > > > If I need to cache a RBD that is on a single server, there is > > > > > no requirement to make the cache redundant across > > > > racks/PDU's/servers...etc. > > > > > > > > > > I hope I've answered your question? > > > > > > > > > > > > > > > > 2. We've got a branch which should merge soon (tomorrow > > > > > > probably) which actually does allow writes to be proxied, so > > > > > > that should alleviate some of these pain points somewhat. I'm > > > > > > not sure it is clever enough to allow through writefulls for > > > > > > an ec base tier though (but it would be a good > > > > > idea!) - > > > > > > > > > > Excellent news, I shall look forward to testing in the future. I > > > > > did mention the proxy write for write fulls to someone who was > > > > > working on the proxy write code, but I'm not sure if it ever got > followed > > up. > > > > > > > > I think someone here is me. In the current code, for an ec base > > > > tier, > > > writefull > > > > can be proxied to the base. > > > > > > Excellent news. Is this intelligent enough to determine when say a > > > normal write IO from a RBD is equal to the underlying object size > > > and then turn this normal write effectively into a write full? > > > > Checked the code, seems we don't do this right now... Would this be > > much helpful? I think we can do this if the answer is yes. > > Hopefully yes. Erasure code is very suited to storing backups capacity wise and > in a lot of backup software you can configure it to write in static size blocks, > which could be set to the object size. With the current tiering code you end up > with a lot of IO amplification and poor performance, if the above feature was > possible, it should perform a lot better. > > Does that make sense? Yep. It makes sense in this case. Actually, the backup software doesn't need to write in units of object size. As long as it spans a full object, then this object can be written in writefull. I'll see if I can come up with an implementation of this. > > If you are also caching the RBD, through some sort of block cache like > mentioned in this thread, then small sequential writes could also be assembled > in cache and then flushed straight through to the erasure tier as proxy full > writes. This is probably less appealing than the backup case but gives the same > advantages as RAID5/6 when equipped with a battery backed cache, which > also has massive performance gains when able to write a full stripe. > > > > > > > > > > > > > > > > > > > > > Sam > > > > > > > > > > > > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <nick@xxxxxxxxxx> > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> -----Original Message----- > > > > > > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] > > > > > > >> On Behalf Of Mark Nelson > > > > > > >> Sent: 18 August 2015 18:51 > > > > > > >> To: Nick Fisk <nick@xxxxxxxxxx>; 'Jan Schermer' > > > > > > >> <jan@xxxxxxxxxxx> > > > > > > >> Cc: ceph-users@xxxxxxxxxxxxxx > > > > > > >> Subject: Re: any recommendation of using > > EnhanceIO? > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> On 08/18/2015 11:52 AM, Nick Fisk wrote: > > > > > > >> > <snip> > > > > > > >> >>>> > > > > > > >> >>>> Here's kind of how I see the field right now: > > > > > > >> >>>> > > > > > > >> >>>> 1) Cache at the client level. Likely fastest but > > > > > > >> >>>> obvious issues like > > > > > > > above. > > > > > > >> >>>> RAID1 might be an option at increased cost. Lack of > > > > > > >> >>>> barriers in some implementations scary. > > > > > > >> >>> > > > > > > >> >>> Agreed. > > > > > > >> >>> > > > > > > >> >>>> > > > > > > >> >>>> 2) Cache below the OSD. Not much recent data on this. > > > > > > >> >>>> Not likely as fast as client side cache, but likely > > > > > > >> >>>> cheaper (fewer OSD nodes than client > > > > > > >> >> nodes?). > > > > > > >> >>>> Lack of barriers in some implementations scary. > > > > > > >> >>> > > > > > > >> >>> This also has the benefit of caching the leveldb on the > > > > > > >> >>> OSD, so get a big > > > > > > >> >> performance gain from there too for small sequential writes. > > > > > > >> >> I looked at using Flashcache for this too but decided it > > > > > > >> >> was adding to much complexity and risk. > > > > > > >> >>> > > > > > > >> >>> I thought I read somewhere that RocksDB allows you to > > > > > > >> >>> move its WAL to > > > > > > >> >> SSD, is there anything in the pipeline for something > > > > > > >> >> like moving the filestore to use RocksDB? > > > > > > >> >> > > > > > > >> >> I believe you can already do this, though I haven't > > > > > > >> >> tested > it. > > > > > > >> >> You can certainly move the monitors to rocksdb (tested) > > > > > > >> >> and newstore uses > > > > > > >> rocksdb as well. > > > > > > >> >> > > > > > > >> > > > > > > > >> > Interesting, I might have a look into this. > > > > > > >> > > > > > > > >> >>> > > > > > > >> >>>> > > > > > > >> >>>> 3) Ceph Cache Tiering. Network overhead and write > > > > > > >> >>>> amplification on promotion makes this primarily useful > > > > > > >> >>>> when workloads fit mostly into the cache tier. > > > > > > >> >>>> Overall safe design but care must be taken to not > > > > > > >> >>>> over- > > > > > > >> >> promote. > > > > > > >> >>>> > > > > > > >> >>>> 4) separate SSD pool. Manual and not particularly > > > > > > >> >>>> flexible, but perhaps > > > > > > >> >> best > > > > > > >> >>>> for applications that need consistently high performance. > > > > > > >> >>> > > > > > > >> >>> I think it depends on the definition of performance. > > > > > > >> >>> Currently even very > > > > > > >> >> fast CPU's and SSD's in their own pool will still > > > > > > >> >> struggle to get less than 1ms of write latency. If your > > > > > > >> >> performance requirements are for large queue depths then > > > > > > >> >> you will probably be alright. If you require something > > > > > > >> >> that mirrors the performance of traditional write back > > > > > > >> >> cache, then even pure SSD Pools > > > > > can start to struggle. > > > > > > >> >> > > > > > > >> >> Agreed. This is definitely the crux of the problem. > > > > > > >> >> The example below is a great start! It'd would be > > > > > > >> >> fantastic if we could get more feedback from the list on > > > > > > >> >> the relative importance of low latency operations vs > > > > > > >> >> high IOPS through concurrency. We have general > > > > > > >> >> suspicions but not a ton of actual data regarding what > > > > > > >> >> folks are seeing in practice and > > > under > > > > what scenarios. > > > > > > >> >> > > > > > > >> > > > > > > > >> > If you have any specific questions that you think I might > > > > > > >> > be able to > > > > > > > answer, > > > > > > >> please let me know. The only other main app that I can > > > > > > >> really think of > > > > > > > where > > > > > > >> these sort of write latency is critical is SQL, > > > > > > >> particularly the > > > > > > > transaction logs. > > > > > > >> > > > > > > >> Probably the big question is what are the pain points? The > > > > > > >> most common answer we get when asking folks what > > > > > > >> applications they run on top of Ceph is "everything!". > > > > > > >> This is wonderful, but not helpful when trying to > > > > > > > figure out > > > > > > >> what performance issues matter most! :) > > > > > > > > > > > > > > Sort of like someone telling you their pc is broken and when > > > > > > > asked for details getting "It's not working" in return. > > > > > > > > > > > > > > In general I think a lot of it comes down to people not > > > > > > > appreciating the differences between Ceph and say a Raid array. > > > > > > > For most things like larger block IO performance tends to > > > > > > > scale with cluster size and the cost effectiveness of Ceph > > > > > > > makes this a no brainer not to just add a handful of extra > OSD's. > > > > > > > > > > > > > > I will try and be more precise. Here is my list of pain > > > > > > > points / wishes that I have come across in the last 12 > > > > > > > months of running > > > Ceph. > > > > > > > > > > > > > > 1. Improve small IO write latency As discussed in depth in > > > > > > > this thread. If it's possible just to make Ceph a lot faster > > > > > > > then great, but I fear even a doubling in performance will > > > > > > > still fall short compared to if you are caching writes at > > > > > > > the client. Most things in Ceph tend to improve with scale, > > > > > > > but write latency is the same with 2 OSD's as it is with > > > > > > > 2000. I would urge some sort of investigation into the > > > > > > > possibility of some sort of persistent librbd caching. This > > > > > > > will probably help across a large number of scenarios, as in > > > > > > > the end, most things are effected by latency > > > > > and > > > > > > I think will provide across the board improvements. > > > > > > > > > > > > > > 2. Cache Tiering > > > > > > > I know a lot of work is going into this currently, but I > > > > > > > will cover my experience. > > > > > > > 2A)Deletion of large RBD's takes forever. It seems to have > > > > > > > to promote all objects, even non-existent ones to the cache > > > > > > > tier before it can > > > > > delete > > > > > > them. > > > > > > > Operationally this is really poor as it has a negative > > > > > > > effect on the cache tier contents as well. > > > > > > > 2B) Erasure Coding requires all writes to be promoted 1st. I > > > > > > > think it should be pretty easy to allow proxy writes for > > > > > > > erasure coded pools if the IO size = Object Size. A lot of > > > > > > > backup applications can be configured to write out in static > > > > > > > sized blocks and would be an ideal candidate for this sort > > > > > > > of > > enhancement. > > > > > > > 2C) General Performance, hopefully this will be fixed by > > > > > > > upcoming > > > > > changes. > > > > > > > 2D) Don't count consecutive sequential reads to the same > > > > > > > object as a trigger for promotion. I currently have problems > > > > > > > where reading sequentially through a large RBD, causes it to > > > > > > > be completely promoted because the read IO size is smaller > > > > > > > than the underlying object > > > > > size. > > > > > > > > > > > > > > 3. Kernel RBD Client > > > > > > > Either implement striping or see if it's possible to > > > > > > > configure readahead > > > > > > > +max_sectors_kb size to be larger than the object size. I > > > > > > > +started a thread > > > > > > > about this a few days ago if you are interested in more details. > > > > > > > > > > > > > > 4. Disk based OSD with SSD Journal performance As I touched > > > > > > > on above earlier, I would expect a disk based OSD with SSD > > > > > > > journal to have similar performance to a pure SSD OSD when > > > > > > > dealing with sequential small IO's. Currently the levelDB > > > > > > > sync and potentially other things slow this down. > > > > > > > > > > > > > > 5. iSCSI > > > > > > > I know Mike Christie is doing a lot of good work in getting > > > > > > > LIO to work with Ceph, but currently it feels like a bit of > > > > > > > a amateur affair getting it going. > > > > > > > > > > > > > > 6. Slow xattr problem > > > > > > > I've a weird problem a couple of times, where RBD's with > > > > > > > data that hasn't been written to for a while seem to start > > > > > > > performing reads very slowly. With the help of Somnath in a > > > > > > > thread here we managed to track it down to a xattr taking > > > > > > > very long to be retrieved, but no > > > idea > > > > why. > > > > > > > Overwriting the RBD with fresh data seemed to stop it happening. > > > > > > > Hopefully Newstore might stop this happening in the future. > > > > > > > > > > > > > >> > > > > > > >> IE, should we be focusing on IOPS? Latency? Finding a way > > > > > > >> to avoid > > > > > > > journal > > > > > > >> overhead for large writes? Are there specific use cases > > > > > > >> where we should specifically be focusing attention? general > iscsi? > > S3? > > > > > > >> databases directly on RBD? etc. There's tons of different > > > > > > >> areas that we > > > > > > > can > > > > > > >> work on (general OSD threading improvements, different > > > > > > >> messenger implementations, newstore, client side > > > > > > >> bottlenecks, > > > > > > >> etc) but all of those things tackle different kinds of > problems. > > > > > > >> > > > > > > >> Mark > > > > > > >> _______________________________________________ > > > > > > >> ceph-users mailing list > > > > > > >> ceph-users@xxxxxxxxxxxxxx > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > ceph-users mailing list > > > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > _______________________________________________ > > > > > > ceph-users mailing list > > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > ceph-users mailing list > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com