1. We've kicked this around a bit. What kind of failure semantics would you be comfortable with here (that is, what would be reasonable behavior if the client side cache fails)? 2. We've got a branch which should merge soon (tomorrow probably) which actually does allow writes to be proxied, so that should alleviate some of these pain points somewhat. I'm not sure it is clever enough to allow through writefulls for an ec base tier though (but it would be a good idea!) -Sam On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > > > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of >> Mark Nelson >> Sent: 18 August 2015 18:51 >> To: Nick Fisk <nick@xxxxxxxxxx>; 'Jan Schermer' <jan@xxxxxxxxxxx> >> Cc: ceph-users@xxxxxxxxxxxxxx >> Subject: Re: any recommendation of using EnhanceIO? >> >> >> >> On 08/18/2015 11:52 AM, Nick Fisk wrote: >> > <snip> >> >>>> >> >>>> Here's kind of how I see the field right now: >> >>>> >> >>>> 1) Cache at the client level. Likely fastest but obvious issues like > above. >> >>>> RAID1 might be an option at increased cost. Lack of barriers in >> >>>> some implementations scary. >> >>> >> >>> Agreed. >> >>> >> >>>> >> >>>> 2) Cache below the OSD. Not much recent data on this. Not likely >> >>>> as fast as client side cache, but likely cheaper (fewer OSD nodes >> >>>> than client >> >> nodes?). >> >>>> Lack of barriers in some implementations scary. >> >>> >> >>> This also has the benefit of caching the leveldb on the OSD, so get >> >>> a big >> >> performance gain from there too for small sequential writes. I looked >> >> at using Flashcache for this too but decided it was adding to much >> >> complexity and risk. >> >>> >> >>> I thought I read somewhere that RocksDB allows you to move its WAL >> >>> to >> >> SSD, is there anything in the pipeline for something like moving the >> >> filestore to use RocksDB? >> >> >> >> I believe you can already do this, though I haven't tested it. You >> >> can certainly move the monitors to rocksdb (tested) and newstore uses >> rocksdb as well. >> >> >> > >> > Interesting, I might have a look into this. >> > >> >>> >> >>>> >> >>>> 3) Ceph Cache Tiering. Network overhead and write amplification on >> >>>> promotion makes this primarily useful when workloads fit mostly >> >>>> into the cache tier. Overall safe design but care must be taken to >> >>>> not over- >> >> promote. >> >>>> >> >>>> 4) separate SSD pool. Manual and not particularly flexible, but >> >>>> perhaps >> >> best >> >>>> for applications that need consistently high performance. >> >>> >> >>> I think it depends on the definition of performance. Currently even >> >>> very >> >> fast CPU's and SSD's in their own pool will still struggle to get >> >> less than 1ms of write latency. If your performance requirements are >> >> for large queue depths then you will probably be alright. If you >> >> require something that mirrors the performance of traditional write >> >> back cache, then even pure SSD Pools can start to struggle. >> >> >> >> Agreed. This is definitely the crux of the problem. The example >> >> below is a great start! It'd would be fantastic if we could get more >> >> feedback from the list on the relative importance of low latency >> >> operations vs high IOPS through concurrency. We have general >> >> suspicions but not a ton of actual data regarding what folks are >> >> seeing in practice and under what scenarios. >> >> >> > >> > If you have any specific questions that you think I might be able to > answer, >> please let me know. The only other main app that I can really think of > where >> these sort of write latency is critical is SQL, particularly the > transaction logs. >> >> Probably the big question is what are the pain points? The most common >> answer we get when asking folks what applications they run on top of Ceph >> is "everything!". This is wonderful, but not helpful when trying to > figure out >> what performance issues matter most! :) > > Sort of like someone telling you their pc is broken and when asked for > details getting "It's not working" in return. > > In general I think a lot of it comes down to people not appreciating the > differences between Ceph and say a Raid array. For most things like larger > block IO performance tends to scale with cluster size and the cost > effectiveness of Ceph makes this a no brainer not to just add a handful of > extra OSD's. > > I will try and be more precise. Here is my list of pain points / wishes that > I have come across in the last 12 months of running Ceph. > > 1. Improve small IO write latency > As discussed in depth in this thread. If it's possible just to make Ceph a > lot faster then great, but I fear even a doubling in performance will still > fall short compared to if you are caching writes at the client. Most things > in Ceph tend to improve with scale, but write latency is the same with 2 > OSD's as it is with 2000. I would urge some sort of investigation into the > possibility of some sort of persistent librbd caching. This will probably > help across a large number of scenarios, as in the end, most things are > effected by latency and I think will provide across the board improvements. > > 2. Cache Tiering > I know a lot of work is going into this currently, but I will cover my > experience. > 2A)Deletion of large RBD's takes forever. It seems to have to promote all > objects, even non-existent ones to the cache tier before it can delete them. > Operationally this is really poor as it has a negative effect on the cache > tier contents as well. > 2B) Erasure Coding requires all writes to be promoted 1st. I think it should > be pretty easy to allow proxy writes for erasure coded pools if the IO size > = Object Size. A lot of backup applications can be configured to write out > in static sized blocks and would be an ideal candidate for this sort of > enhancement. > 2C) General Performance, hopefully this will be fixed by upcoming changes. > 2D) Don't count consecutive sequential reads to the same object as a trigger > for promotion. I currently have problems where reading sequentially through > a large RBD, causes it to be completely promoted because the read IO size is > smaller than the underlying object size. > > 3. Kernel RBD Client > Either implement striping or see if it's possible to configure readahead > +max_sectors_kb size to be larger than the object size. I started a thread > about this a few days ago if you are interested in more details. > > 4. Disk based OSD with SSD Journal performance > As I touched on above earlier, I would expect a disk based OSD with SSD > journal to have similar performance to a pure SSD OSD when dealing with > sequential small IO's. Currently the levelDB sync and potentially other > things slow this down. > > 5. iSCSI > I know Mike Christie is doing a lot of good work in getting LIO to work with > Ceph, but currently it feels like a bit of a amateur affair getting it > going. > > 6. Slow xattr problem > I've a weird problem a couple of times, where RBD's with data that hasn't > been written to for a while seem to start performing reads very slowly. With > the help of Somnath in a thread here we managed to track it down to a xattr > taking very long to be retrieved, but no idea why. Overwriting the RBD with > fresh data seemed to stop it happening. Hopefully Newstore might stop this > happening in the future. > >> >> IE, should we be focusing on IOPS? Latency? Finding a way to avoid > journal >> overhead for large writes? Are there specific use cases where we should >> specifically be focusing attention? general iscsi? S3? >> databases directly on RBD? etc. There's tons of different areas that we > can >> work on (general OSD threading improvements, different messenger >> implementations, newstore, client side bottlenecks, etc) but all of those >> things tackle different kinds of problems. >> >> Mark >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com