OK, you asked ;-) This is all via RBD, I am running a single filesystem on top of 8 RBD devices in an effort to get data striping across more OSDs, I had been using that setup before adding the cache tier. 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is setup with replication size 3. No SSDs involved in those OSDs, since ceph-disk does not let you break a bluestore configuration into more than one device at the moment. The 600 Mbytes/sec is an approx sustained number for the data rate I can get going into this pool via RBD, that turns into 3 times that for raw data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have pushed it harder than that from time to time, but the OSD really wants to use fdatasync a lot and that tends to suck up a lot of the potential of a device, these disks will do 160 Mbytes/sec if you stream data to them. I just checked with rados bench to this set of 33 OSDs with a 3 replica pool, and 600 Mbytes/sec is what it will do from the same client host. All the networking is 40 GB ethernet, single port per host, generally I can push 2.2 Gbytes/sec in one direction between two hosts over a single tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a node. Short of going to RDMA that appears to be about the limit for these processors. There are a grand total of 2 400 GB P3700s which are running a pool with a replication factor of 1, these are in 2 other nodes. Once I add in replication perf goes downhill. If I had more hardware I would be running more of these and using replication, but I am out of network cards right now. So 5 nodes running OSDs, and a 6th node running the RBD client using the kernel implementation. Complete set of commands for creating the cache tier, I pulled this from history, so the line in the middle was a failed command actually so sorry for the red herring. 982 ceph osd pool create nvme 512 512 replicated_nvme 983 ceph osd pool set nvme size 1 984 ceph osd tier add rbd nvme 985 ceph osd tier cache-mode nvme writeback 986 ceph osd tier set-overlay rbd nvme 987 ceph osd pool set nvme hit_set_type bloom 988 ceph osd pool set target_max_bytes 500000000000 <<—— typo here, so never mind 989 ceph osd pool set nvme target_max_bytes 500000000000 990 ceph osd pool set nvme target_max_objects 500000 991 ceph osd pool set nvme cache_target_dirty_ratio 0.5 992 ceph osd pool set nvme cache_target_full_ratio 0.8 I wish the cache tier would cause a health warning if it does not have a max size set, it lets you do that, flushes nothing and fills the OSDs. As for what the actual test is, this is 4K uncompressed DPX video frames, so 50 Mbyte files written at least 24 a second on a good day, ideally more. This needs to sustain around 1.3 Gbytes/sec in either direction from a single application and needs to do it consistently. There is a certain amount of buffering to deal with fluctuations in perf. I am pushing 4096 of these files sequentially with a queue depth of 32 so there is rather a lot of data in flight at any one time. I know I do not have enough hardware to achieve this rate on writes. The are being written with direct I/O into a pool of 8 RBD LUNs. The 8 LUN setup will not really help here with the small number of OSDs in the cache pool, it does help when the RBD LUNs are going directly to a large pool of disk based OSDs as it gets all the OSDs moving in parallel. My basic point here is that there is a lot more potential bandwidth to be had in the backing pool, but I cannot get the cache tier to use more than a small fraction of the available bandwidth when flushing content. Since the front end of the cache can sustain around 900 Mbytes/sec over RBD, I am somewhat out of balance here: cache input rate 900 Mbytes/sec backing pool input rate 600 Mbytes/sec But not by a significant amount. The question is really about is there anything I can do to get cache flushing to take advantage of more of the bandwidth. If I do this without the cache tier then the latency of the disk based OSDs is too variable and you cannot sustain a consistent data rate. The NVMe devices are better about consistent device latency, but the cache tier implementation seems to have a problem driving the backing pool at anything close to its capabilities. It really only needs to move 40 or 50 objects in parallel to achieve that. I am not attempting to provision a cache tier large enough for whole workload, but as more of a debounce zone to avoid jitter making it back to the application. I am trying to categorize what can and cannot be achieved with ceph here for this type of workload, not build a complete production setup. My test represents 170 seconds of content and generates 209 Gbytes of data, so this is a small scale test ;-) fortunately this stuff is not always used realtime. All of those extra config options look to be around how fast promotion into the cache can go, not how fast you can get things out of it :-( I have been using readforward and that is working OK, there is sufficient read bandwidth that it does not matter if data is coming from the cache pool or the disk backing pool. Steve > On Apr 19, 2016, at 7:47 PM, Christian Balzer <chibi@xxxxxxx> wrote: > > > Hello, > > On Tue, 19 Apr 2016 20:21:39 +0000 Stephen Lord wrote: > >> >> >> I Have a setup using some Intel P3700 devices as a cache tier, and 33 >> sata drives hosting the pool behind them. > > A bit more details about the setup would be nice, as in how many nodes, > interconnect, replication size of the cache tier and the backing HDD > pool, etc. > And "some" isn't a number, how many P3700s (which size?) in how many nodes? > One assumes there are no further SSDs involved with those SATA HDDs? > >> I setup the cache tier with >> writeback, gave it a size and max object count etc: >> >> ceph osd pool set target_max_bytes 500000000000 > ^^^ > This should have given you an error, it needs the pool name, as in your > next line. > >> ceph osd pool set nvme target_max_bytes 500000000000 >> ceph osd pool set nvme target_max_objects 500000 >> ceph osd pool set nvme cache_target_dirty_ratio 0.5 >> ceph osd pool set nvme cache_target_full_ratio 0.8 >> >> This is all running Jewel using bluestore OSDs (I know experimental). > Make sure to report all pyrotechnics, trap doors and sharp edges. ^_- > >> The cache tier will write at about 900 Mbytes/sec and read at 2.2 >> Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in >> aggregate. > ^^^^^^^^^ > Key word there. > > That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty > disappointing result for the supposedly twice as fast BlueStore. > Again, replication size and topology might explain that up to a point, but > we don't know them (yet). > > Also exact methodology of your tests please, i.e. the fio command line, how > was the RBD device (if you tested with one) mounted and where, etc... > >> However, it looks like the mechanism for cleaning the cache >> down to the disk layer is being massively rate limited and I see about >> 47 Mbytes/sec of read activity from each SSD while this is going on. >> > This number is meaningless w/o knowing home many NVMe's you have. > That being said, there are 2 levels of flushing past Hammer, but if you > push the cache tier to the 2nd limit (cache_target_dirty_high_ratio) you > will get full speed. > >> This means that while I could be pushing data into the cache at high >> speed, It cannot evict old content very fast at all, and it is very easy >> to hit the high water mark and the application I/O drops dramatically as >> it becomes throttled by how fast the cache can flush. >> >> I suspect it is operating on a placement group at a time so ends up >> targeting a very limited number of objects and hence disks at any one >> time. I can see individual disk drives going busy for very short >> periods, but most of them are idle at any one point in time. The only >> way to drive the disk based OSDs fast is to hit a lot of them at once >> which would mean issuing many cache flush operations in parallel. >> > Yes, it is all PG based, so your observations match the expectations and > what everybody else is seeing. > See also the thread "Cache tier operation clarifications" by me, version 2 > is in the works. > There are also some new knobs in Jewel that may be helpful, see: > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spinics.net_lists_ceph-2Dusers_msg25679.html&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=XV0tKumf_xV99IUKYgnbmJrvhbL9I5Fdk1eCwG-YUYQ&e= > > If you have a use case with a clearly defined idle/low use time and a > small enough growth in dirty objects, consider what I'm doing, dropping the > cache_target_dirty_ratio a few percent (in my case 2-3% is enough for a > whole day) via cron job,wait a bit and then up again to it's normal value. > > That way flushes won't normally happen at all during your peak usage > times, though in my case that's purely cosmetic, flushes are not > problematic at any time in that cluster currently. > >> Are there any controls which can influence this behavior? >> > See above (cache_target_dirty_high_ratio). > > Aside from that you might want to reflect on what your use case, workload > is going to be and how your testing reflects on it. > > As in, are you really going to write MASSIVE amounts of data at very high > speeds or is it (like in 90% of common cases) the amount of small > write IOPS that is really going to be the limiting factor. > Which is something that cache tiers can deal with very well (or > sufficiently large and well designed "plain" clusters). > > Another thing to think about is using the "readforward" cache mode, > leaving your cache tier free to just handle writes and thus giving it more > space to work with. > > Christian > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gol.com_&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=D1RK9OOi5QmOFzGURxC82nkUr7mAe2-Ifo2FNgqYVQY&e= ---------------------------------------------------------------------- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com