<snip> > >> > >> Here's kind of how I see the field right now: > >> > >> 1) Cache at the client level. Likely fastest but obvious issues like above. > >> RAID1 might be an option at increased cost. Lack of barriers in some > >> implementations scary. > > > > Agreed. > > > >> > >> 2) Cache below the OSD. Not much recent data on this. Not likely as > >> fast as client side cache, but likely cheaper (fewer OSD nodes than client > nodes?). > >> Lack of barriers in some implementations scary. > > > > This also has the benefit of caching the leveldb on the OSD, so get a big > performance gain from there too for small sequential writes. I looked at > using Flashcache for this too but decided it was adding to much complexity > and risk. > > > > I thought I read somewhere that RocksDB allows you to move its WAL to > SSD, is there anything in the pipeline for something like moving the filestore > to use RocksDB? > > I believe you can already do this, though I haven't tested it. You can certainly > move the monitors to rocksdb (tested) and newstore uses rocksdb as well. > Interesting, I might have a look into this. > > > >> > >> 3) Ceph Cache Tiering. Network overhead and write amplification on > >> promotion makes this primarily useful when workloads fit mostly into the > >> cache tier. Overall safe design but care must be taken to not over- > promote. > >> > >> 4) separate SSD pool. Manual and not particularly flexible, but perhaps > best > >> for applications that need consistently high performance. > > > > I think it depends on the definition of performance. Currently even very > fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of > write latency. If your performance requirements are for large queue depths > then you will probably be alright. If you require something that mirrors the > performance of traditional write back cache, then even pure SSD Pools can > start to struggle. > > Agreed. This is definitely the crux of the problem. The example below > is a great start! It'd would be fantastic if we could get more feedback > from the list on the relative importance of low latency operations vs > high IOPS through concurrency. We have general suspicions but not a ton > of actual data regarding what folks are seeing in practice and under > what scenarios. > If you have any specific questions that you think I might be able to answer, please let me know. The only other main app that I can really think of where these sort of write latency is critical is SQL, particularly the transaction logs. > > > > > > To give a real world example of what I see when doing various tests, here > is a rough guide to IOP's when removing a snapshot on a ESX server > > > > Traditional Array 10K disks = 300-600 IOPs > > Ceph 7.2K + SSD Journal = 100-200 IOPs (LevelDB syncing on OSD seems to > be the main limitation) > > Ceph Pure SSD Pool = 500 IOPs (Intel s3700 SSD's) > > I'd be curious to see how much jemalloc or tcmalloc 2.4 + 128MB TC help > here. Sandisk and Intel have both done some very useful investigations, > I've got some additional tests replicating some of their findings coming > shortly. Ok, will be interesting to se. I will see if I can change it on my environment and if it has any improvement. I think I came to the conclusion that Ceph takes a certain amount of time to do a write and by the time you add in a replica copy I was struggling to get much below 2ms per IO with my 2.1GHz CPU's. 2ms = ~500IOPs. > > > Ceph Cache Tiering = 10-500 IOPs (As we know, misses can be very painful) > > Indeed. There's some work going on in this area too. Hopefully we'll > know how some of our ideas pan out later this week. Assuming excessive > promotions aren't a problem, the jemalloc/tcmalloc improvements I > suspect will generally make cache teiring more interesting (though > buffer cache will still be the primary source of really hot cached reads) > > > Ceph + RBD Caching with Flashcache = 200-1000 IOPs (Readahead can give > high bursts if snapshot blocks are sequential) > > Good to know! > > > > > And when copying VM's to datastore (ESXi does this in sequential 64k > IO's.....yes silly I know) > > > > Traditional Array 10K disks = ~100MB/s (Limited by 1GB interface, on other > arrays I guess this scales) > > Ceph 7.2K + SSD Journal = ~20MB/s (Again LevelDB sync seems to limit here > for sequential writes) > > This is pretty bad. Is RBD cache enabled? Tell me about it, moving a 2TB VM is a painful experience. Yes the librbd cache is on, but as iSCSI effectively turns all writes into sync writes so this bypasses the cache, so you are dependent on the time it takes for each OSD to ACK the write. In this case waiting each time for 64kb IO's to complete due to the levelDB sync you end up with transfer speeds somewhere in the region of 15-20MB/s. You can do the same thing with something IOmeter (64k, sequential write, directio, QD=1). NFS is even worse as every ESX write also requires a FS journal sync on the FS being used for NFS. So you have to wait for two ACK's from Ceph, normally meaning <10MB/s. Here is a thread from earlier in the year when I stumbled on the reason behind this (Last post) http://comments.gmane.org/gmane.comp.file-systems.ceph.user/18393 > > > Ceph Pure SSD Pool = ~50MB/s (Ceph CPU bottleneck is occurring) > > Again, seems pretty rough compared to what I'd expect to see! Same as above but CPU replaces levelDB sync as bottleneck in my findings > > > Ceph Cache Tiering = ~50MB/s when writing to new block, <10MB/s when > promote+overwrite > > Ceph + RBD Caching with Flashcache = As fast as the SSD will go _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com