> -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Nick Fisk > Sent: Wednesday, August 19, 2015 5:25 AM > To: 'Samuel Just' > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: any recommendation of using EnhanceIO? > > Hi Sam, > > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > > Of Samuel Just > > Sent: 18 August 2015 21:38 > > To: Nick Fisk <nick@xxxxxxxxxx> > > Cc: ceph-users@xxxxxxxxxxxxxx > > Subject: Re: any recommendation of using EnhanceIO? > > > > 1. We've kicked this around a bit. What kind of failure semantics > > would > you > > be comfortable with here (that is, what would be reasonable behavior > > if > the > > client side cache fails)? > > I would either expect to provide the cache with a redundant block device (ie > RAID1 SSD's) or the cache to allow itself to be configured to mirror across two > SSD's. Of course single SSD's can be used if the user accepts the risk. > If the cache did the mirroring then you could do fancy stuff like mirror the > writes, but leave the read cache blocks as single copies to increase the cache > capacity. > > In either case although an outage is undesirable, its only data loss which would > be unacceptable, which would hopefully be avoided by the mirroring. As part of > this, it would need to be a way to make sure a "dirty" RBD can't be accessed > unless the corresponding cache is also attached. > > I guess as it caching the RBD and not the pool or entire cluster, the cache only > needs to match the failure requirements of the application its caching. > If I need to cache a RBD that is on a single server, there is no requirement to > make the cache redundant across racks/PDU's/servers...etc. > > I hope I've answered your question? > > > > 2. We've got a branch which should merge soon (tomorrow probably) > > which actually does allow writes to be proxied, so that should > > alleviate some of these pain points somewhat. I'm not sure it is > > clever enough to allow through writefulls for an ec base tier though > > (but it would be a good > idea!) - > > Excellent news, I shall look forward to testing in the future. I did mention the > proxy write for write fulls to someone who was working on the proxy write code, > but I'm not sure if it ever got followed up. I think someone here is me. In the current code, for an ec base tier, writefull can be proxied to the base. > > > Sam > > > > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > >> -----Original Message----- > > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > >> Behalf Of Mark Nelson > > >> Sent: 18 August 2015 18:51 > > >> To: Nick Fisk <nick@xxxxxxxxxx>; 'Jan Schermer' <jan@xxxxxxxxxxx> > > >> Cc: ceph-users@xxxxxxxxxxxxxx > > >> Subject: Re: any recommendation of using EnhanceIO? > > >> > > >> > > >> > > >> On 08/18/2015 11:52 AM, Nick Fisk wrote: > > >> > <snip> > > >> >>>> > > >> >>>> Here's kind of how I see the field right now: > > >> >>>> > > >> >>>> 1) Cache at the client level. Likely fastest but obvious > > >> >>>> issues like > > > above. > > >> >>>> RAID1 might be an option at increased cost. Lack of barriers > > >> >>>> in some implementations scary. > > >> >>> > > >> >>> Agreed. > > >> >>> > > >> >>>> > > >> >>>> 2) Cache below the OSD. Not much recent data on this. Not > > >> >>>> likely as fast as client side cache, but likely cheaper (fewer > > >> >>>> OSD nodes than client > > >> >> nodes?). > > >> >>>> Lack of barriers in some implementations scary. > > >> >>> > > >> >>> This also has the benefit of caching the leveldb on the OSD, so > > >> >>> get a big > > >> >> performance gain from there too for small sequential writes. I > > >> >> looked at using Flashcache for this too but decided it was > > >> >> adding to much complexity and risk. > > >> >>> > > >> >>> I thought I read somewhere that RocksDB allows you to move its > > >> >>> WAL to > > >> >> SSD, is there anything in the pipeline for something like moving > > >> >> the filestore to use RocksDB? > > >> >> > > >> >> I believe you can already do this, though I haven't tested it. > > >> >> You can certainly move the monitors to rocksdb (tested) and > > >> >> newstore uses > > >> rocksdb as well. > > >> >> > > >> > > > >> > Interesting, I might have a look into this. > > >> > > > >> >>> > > >> >>>> > > >> >>>> 3) Ceph Cache Tiering. Network overhead and write > > >> >>>> amplification on promotion makes this primarily useful when > > >> >>>> workloads fit mostly into the cache tier. Overall safe design > > >> >>>> but care must be taken to not over- > > >> >> promote. > > >> >>>> > > >> >>>> 4) separate SSD pool. Manual and not particularly flexible, > > >> >>>> but perhaps > > >> >> best > > >> >>>> for applications that need consistently high performance. > > >> >>> > > >> >>> I think it depends on the definition of performance. Currently > > >> >>> even very > > >> >> fast CPU's and SSD's in their own pool will still struggle to > > >> >> get less than 1ms of write latency. If your performance > > >> >> requirements are for large queue depths then you will probably > > >> >> be alright. If you require something that mirrors the > > >> >> performance of traditional write back cache, then even pure SSD Pools > can start to struggle. > > >> >> > > >> >> Agreed. This is definitely the crux of the problem. The > > >> >> example below is a great start! It'd would be fantastic if we > > >> >> could get more feedback from the list on the relative importance > > >> >> of low latency operations vs high IOPS through concurrency. We > > >> >> have general suspicions but not a ton of actual data regarding > > >> >> what folks are seeing in practice and under what scenarios. > > >> >> > > >> > > > >> > If you have any specific questions that you think I might be able > > >> > to > > > answer, > > >> please let me know. The only other main app that I can really think > > >> of > > > where > > >> these sort of write latency is critical is SQL, particularly the > > > transaction logs. > > >> > > >> Probably the big question is what are the pain points? The most > > >> common answer we get when asking folks what applications they run > > >> on top of Ceph is "everything!". This is wonderful, but not > > >> helpful when trying to > > > figure out > > >> what performance issues matter most! :) > > > > > > Sort of like someone telling you their pc is broken and when asked > > > for details getting "It's not working" in return. > > > > > > In general I think a lot of it comes down to people not appreciating > > > the differences between Ceph and say a Raid array. For most things > > > like larger block IO performance tends to scale with cluster size > > > and the cost effectiveness of Ceph makes this a no brainer not to > > > just add a handful of extra OSD's. > > > > > > I will try and be more precise. Here is my list of pain points / > > > wishes that I have come across in the last 12 months of running Ceph. > > > > > > 1. Improve small IO write latency > > > As discussed in depth in this thread. If it's possible just to make > > > Ceph a lot faster then great, but I fear even a doubling in > > > performance will still fall short compared to if you are caching > > > writes at the client. Most things in Ceph tend to improve with > > > scale, but write latency is the same with 2 OSD's as it is with > > > 2000. I would urge some sort of investigation into the possibility > > > of some sort of persistent librbd caching. This will probably help > > > across a large number of scenarios, as in the end, most things are > > > effected by latency > and > > I think will provide across the board improvements. > > > > > > 2. Cache Tiering > > > I know a lot of work is going into this currently, but I will cover > > > my experience. > > > 2A)Deletion of large RBD's takes forever. It seems to have to > > > promote all objects, even non-existent ones to the cache tier before > > > it can > delete > > them. > > > Operationally this is really poor as it has a negative effect on the > > > cache tier contents as well. > > > 2B) Erasure Coding requires all writes to be promoted 1st. I think > > > it should be pretty easy to allow proxy writes for erasure coded > > > pools if the IO size = Object Size. A lot of backup applications can > > > be configured to write out in static sized blocks and would be an > > > ideal candidate for this sort of enhancement. > > > 2C) General Performance, hopefully this will be fixed by upcoming > changes. > > > 2D) Don't count consecutive sequential reads to the same object as a > > > trigger for promotion. I currently have problems where reading > > > sequentially through a large RBD, causes it to be completely > > > promoted because the read IO size is smaller than the underlying object > size. > > > > > > 3. Kernel RBD Client > > > Either implement striping or see if it's possible to configure > > > readahead > > > +max_sectors_kb size to be larger than the object size. I started a > > > +thread > > > about this a few days ago if you are interested in more details. > > > > > > 4. Disk based OSD with SSD Journal performance As I touched on above > > > earlier, I would expect a disk based OSD with SSD journal to have > > > similar performance to a pure SSD OSD when dealing with sequential > > > small IO's. Currently the levelDB sync and potentially other things > > > slow this down. > > > > > > 5. iSCSI > > > I know Mike Christie is doing a lot of good work in getting LIO to > > > work with Ceph, but currently it feels like a bit of a amateur > > > affair getting it going. > > > > > > 6. Slow xattr problem > > > I've a weird problem a couple of times, where RBD's with data that > > > hasn't been written to for a while seem to start performing reads > > > very slowly. With the help of Somnath in a thread here we managed to > > > track it down to a xattr taking very long to be retrieved, but no idea why. > > > Overwriting the RBD with fresh data seemed to stop it happening. > > > Hopefully Newstore might stop this happening in the future. > > > > > >> > > >> IE, should we be focusing on IOPS? Latency? Finding a way to > > >> avoid > > > journal > > >> overhead for large writes? Are there specific use cases where we > > >> should specifically be focusing attention? general iscsi? S3? > > >> databases directly on RBD? etc. There's tons of different areas > > >> that we > > > can > > >> work on (general OSD threading improvements, different messenger > > >> implementations, newstore, client side bottlenecks, etc) but all of > > >> those things tackle different kinds of problems. > > >> > > >> Mark > > >> _______________________________________________ > > >> ceph-users mailing list > > >> ceph-users@xxxxxxxxxxxxxx > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com