I would advise you to take a look at the osd_agent_max_ops (and osd_agent_max_ops), these should in theory dictate how many parallel threads will be used for flushing. Do a conf dump from the admin socket to see what you are currently running with and then bump them up to see if it helps. > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Josef Johansson > Sent: 20 April 2016 06:57 > To: ceph-users@xxxxxxxxxxxxxx > Subject: Re: ceph cache tier clean rate too low > > Hi, > response in line > On 20 Apr 2016 7:45 a.m., "Christian Balzer" <chibi@xxxxxxx> wrote: > > > > > > Hello, > > > > On Wed, 20 Apr 2016 03:42:00 +0000 Stephen Lord wrote: > > > > > > > > OK, you asked ;-) > > > > > > > I certainly did. ^o^ > > > > > This is all via RBD, I am running a single filesystem on top of 8 RBD > > > devices in an effort to get data striping across more OSDs, I had been > > > using that setup before adding the cache tier. > > > > > Nods. > > Depending on your use case (sequential writes) actual RADOS striping > might > > be more advantageous than this (with 4MB writes still going to the same > > PG/OSD all the time). > > > > > > > 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is > > > setup with replication size 3. No SSDs involved in those OSDs, since > > > ceph-disk does not let you break a bluestore configuration into more > > > than one device at the moment. > > > > > That's a pity, but supposedly just a limitation of ceph-disk. > > I'd venture you can work around that with symlinks to a raw SSD > > partition, same as with current filestore journals. > > > > As Sage recently wrote: > > --- > > BlueStore can use as many as three devices: one for the WAL (journal, > > though it can be much smaller than FileStores, e.g., 128MB), one for > > metadata (e.g., an SSD partition), and one for data. > > --- > I believe he also mentioned the use of bcache and friends for the osd, > maybe a way forward in this case? > Regards > Josef > > > > > The 600 Mbytes/sec is an approx sustained number for the data rate I can > > > get going into this pool via RBD, that turns into 3 times that for raw > > > data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have > > > pushed it harder than that from time to time, but the OSD really wants > > > to use fdatasync a lot and that tends to suck up a lot of the potential > > > of a device, these disks will do 160 Mbytes/sec if you stream data to > > > them. > > > > > > I just checked with rados bench to this set of 33 OSDs with a 3 replica > > > pool, and 600 Mbytes/sec is what it will do from the same client host. > > > > > This matches a cluster of mine with 32 OSDs (filestore of course) and SSD > > journals on 4 nodes with a replica of 3. > > > > So BlueStore is indeed faster than than filestore. > > > > > All the networking is 40 GB ethernet, single port per host, generally I > > > can push 2.2 Gbytes/sec in one direction between two hosts over a single > > > tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a > > > node. Short of going to RDMA that appears to be about the limit for > > > these processors. > > > > > Yeah, didn't expect your network to be involved here bottleneck wise, but > > a good data point to have nevertheless. > > > > > There are a grand total of 2 400 GB P3700s which are running a pool with > > > a replication factor of 1, these are in 2 other nodes. Once I add in > > > replication perf goes downhill. If I had more hardware I would be > > > running more of these and using replication, but I am out of network > > > cards right now. > > > > > Alright, so at 900MB/s you're pretty close to what one would expect from 2 > > of these: 1080MB/s*2/2(journal). > > > > How much downhill is that? > > > > I have a production cache tier with 2 nodes (replica 2 of course) and 4 > > 800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the > performance > > is pretty much what I would expect. > > > > > So 5 nodes running OSDs, and a 6th node running the RBD client using the > > > kernel implementation. > > > > > I assume there's are reason for use the kernel RBD client (which kernel?), > > given that it tends to be behind the curve in terms of features and speed? > > > > > Complete set of commands for creating the cache tier, I pulled this from > > > history, so the line in the middle was a failed command actually so > > > sorry for the red herring. > > > > > > 982 ceph osd pool create nvme 512 512 replicated_nvme > > > 983 ceph osd pool set nvme size 1 > > > 984 ceph osd tier add rbd nvme > > > 985 ceph osd tier cache-mode nvme writeback > > > 986 ceph osd tier set-overlay rbd nvme > > > 987 ceph osd pool set nvme hit_set_type bloom > > > 988 ceph osd pool set target_max_bytes 500000000000 <<—— typo > here, > > > so never mind 989 ceph osd pool set nvme target_max_bytes > 500000000000 > > > 990 ceph osd pool set nvme target_max_objects 500000 > > > 991 ceph osd pool set nvme cache_target_dirty_ratio 0.5 > > > 992 ceph osd pool set nvme cache_target_full_ratio 0.8 > > > > > > I wish the cache tier would cause a health warning if it does not have > > > a max size set, it lets you do that, flushes nothing and fills the OSDs. > > > > > Oh yes, people have been bitten by this over and over again. > > At least it's documented now. > > > > > As for what the actual test is, this is 4K uncompressed DPX video frames, > > > so 50 Mbyte files written at least 24 a second on a good day, ideally > > > more. This needs to sustain around 1.3 Gbytes/sec in either direction > > > from a single application and needs to do it consistently. There is a > > > certain amount of buffering to deal with fluctuations in perf. I am > > > pushing 4096 of these files sequentially with a queue depth of 32 so > > > there is rather a lot of data in flight at any one time. I know I do not > > > have enough hardware to achieve this rate on writes. > > > > > So this is your test AND actual intended use case I presume, right? > > > > > The are being written with direct I/O into a pool of 8 RBD LUNs. The 8 > > > LUN setup will not really help here with the small number of OSDs in the > > > cache pool, it does help when the RBD LUNs are going directly to a large > > > pool of disk based OSDs as it gets all the OSDs moving in parallel. > > > > > > My basic point here is that there is a lot more potential bandwidth to > > > be had in the backing pool, but I cannot get the cache tier to use more > > > than a small fraction of the available bandwidth when flushing content. > > > Since the front end of the cache can sustain around 900 Mbytes/sec over > > > RBD, I am somewhat out of balance here: > > > > > > cache input rate 900 Mbytes/sec > > > backing pool input rate 600 Mbytes/sec > > > > > > But not by a significant amount. > > > > > > The question is really about is there anything I can do to get cache > > > flushing to take advantage of more of the bandwidth. If I do this > > > without the cache tier then the latency of the disk based OSDs is too > > > variable and you cannot sustain a consistent data rate. > > > > This should hopefully be reduced by having WAL and metadata on SSDs > with > > bluestore. > > But HDD based storage will be more jittery, that's a given. > > > > >The NVMe devices > > > are better about consistent device latency, but the cache tier > > > implementation seems to have a problem driving the backing pool at > > > anything close to its capabilities. It really only needs to move 40 or > > > 50 objects in parallel to achieve that. > > > > > And this is clearly where the Ceph (cache-tier) code could probably use > > some attention. > > Mind you, most people have the exact OPPOSITE requirement of yours, > they > > want a steady, slow, low-impact stream of flushes to the HDDs, not an > > avalanche. ^o^ > > > > We know that this is all based on per PG ratios, so if one PG goes over > > the dirty ratio the tier-agent will start flushing objects from it. > > The singular spikes you're seeing. > > > > My bet is that this process is sequential and not parallel (or at least > > not massively so), meaning that until one PG has finished flushing, > > the next dirty'ish one won't start flushing. > > Would be nice to have this confirmed by somebody familiar with the code, > > though. > > > > Of course having fast OSDs in the backing pool will alleviate this > > somewhat. > > > > What's your cache_target_dirty_high_ratio set to and have you tried > > setting it below/identical to cache_target_dirty_ratio? > > > > Christian > > > > > I am not attempting to provision a cache tier large enough for whole > > > workload, but as more of a debounce zone to avoid jitter making it back > > > to the application. I am trying to categorize what can and cannot be > > > achieved with ceph here for this type of workload, not build a complete > > > production setup. My test represents 170 seconds of content and > > > generates 209 Gbytes of data, so this is a small scale test ;-) > > > fortunately this stuff is not always used realtime. > > > > > > All of those extra config options look to be around how fast promotion > > > into the cache can go, not how fast you can get things out of it :-( > > > > > > I have been using readforward and that is working OK, there is > > > sufficient read bandwidth that it does not matter if data is coming from > > > the cache pool or the disk backing pool. > > > > > > Steve > > > > > > > > > > On Apr 19, 2016, at 7:47 PM, Christian Balzer <chibi@xxxxxxx> wrote: > > > > > > > > > > > > Hello, > > > > > > > > On Tue, 19 Apr 2016 20:21:39 +0000 Stephen Lord wrote: > > > > > > > >> > > > >> > > > >> I Have a setup using some Intel P3700 devices as a cache tier, and 33 > > > >> sata drives hosting the pool behind them. > > > > > > > > A bit more details about the setup would be nice, as in how many > nodes, > > > > interconnect, replication size of the cache tier and the backing HDD > > > > pool, etc. > > > > And "some" isn't a number, how many P3700s (which size?) in how > many > > > > nodes? One assumes there are no further SSDs involved with those > SATA > > > > HDDs? > > > > > > > > > > >> I setup the cache tier with > > > >> writeback, gave it a size and max object count etc: > > > >> > > > >> ceph osd pool set target_max_bytes 500000000000 > > > > ^^^ > > > > This should have given you an error, it needs the pool name, as in your > > > > next line. > > > > > > > >> ceph osd pool set nvme target_max_bytes 500000000000 > > > >> ceph osd pool set nvme target_max_objects 500000 > > > >> ceph osd pool set nvme cache_target_dirty_ratio 0.5 > > > >> ceph osd pool set nvme cache_target_full_ratio 0.8 > > > >> > > > >> This is all running Jewel using bluestore OSDs (I know experimental). > > > > Make sure to report all pyrotechnics, trap doors and sharp edges. ^_- > > > > > > > >> The cache tier will write at about 900 Mbytes/sec and read at 2.2 > > > >> Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in > > > >> aggregate. > > > > ^^^^^^^^^ > > > > Key word there. > > > > > > > > That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty > > > > disappointing result for the supposedly twice as fast BlueStore. > > > > Again, replication size and topology might explain that up to a point, > > > > but we don't know them (yet). > > > > > > > > Also exact methodology of your tests please, i.e. the fio command > > > > line, how was the RBD device (if you tested with one) mounted and > > > > where, etc... > > > > > > > >> However, it looks like the mechanism for cleaning the cache > > > >> down to the disk layer is being massively rate limited and I see about > > > >> 47 Mbytes/sec of read activity from each SSD while this is going on. > > > >> > > > > This number is meaningless w/o knowing home many NVMe's you > have. > > > > That being said, there are 2 levels of flushing past Hammer, but if you > > > > push the cache tier to the 2nd limit (cache_target_dirty_high_ratio) > > > > you will get full speed. > > > > > > > >> This means that while I could be pushing data into the cache at high > > > >> speed, It cannot evict old content very fast at all, and it is very > > > >> easy to hit the high water mark and the application I/O drops > > > >> dramatically as it becomes throttled by how fast the cache can flush. > > > >> > > > >> I suspect it is operating on a placement group at a time so ends up > > > >> targeting a very limited number of objects and hence disks at any one > > > >> time. I can see individual disk drives going busy for very short > > > >> periods, but most of them are idle at any one point in time. The only > > > >> way to drive the disk based OSDs fast is to hit a lot of them at once > > > >> which would mean issuing many cache flush operations in parallel. > > > >> > > > > Yes, it is all PG based, so your observations match the expectations > > > > and what everybody else is seeing. > > > > See also the thread "Cache tier operation clarifications" by me, > > > > version 2 is in the works. > > > > There are also some new knobs in Jewel that may be helpful, see: > > > > https://urldefense.proofpoint.com/v2/url?u=http- > 3A__www.spinics.net_lists_ceph- > 2Dusers_msg25679.html&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC- > WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc- > TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav- > 0J24ys7S21EhX394KIorQ- > E&s=XV0tKumf_xV99IUKYgnbmJrvhbL9I5Fdk1eCwG-YUYQ&e= > > > > > > > > If you have a use case with a clearly defined idle/low use time and a > > > > small enough growth in dirty objects, consider what I'm doing, > > > > dropping the cache_target_dirty_ratio a few percent (in my case 2-3% > > > > is enough for a whole day) via cron job,wait a bit and then up again > > > > to it's normal value. > > > > > > > > That way flushes won't normally happen at all during your peak usage > > > > times, though in my case that's purely cosmetic, flushes are not > > > > problematic at any time in that cluster currently. > > > > > > > >> Are there any controls which can influence this behavior? > > > >> > > > > See above (cache_target_dirty_high_ratio). > > > > > > > > Aside from that you might want to reflect on what your use case, > > > > workload is going to be and how your testing reflects on it. > > > > > > > > As in, are you really going to write MASSIVE amounts of data at very > > > > high speeds or is it (like in 90% of common cases) the amount of small > > > > write IOPS that is really going to be the limiting factor. > > > > Which is something that cache tiers can deal with very well (or > > > > sufficiently large and well designed "plain" clusters). > > > > > > > > Another thing to think about is using the "readforward" cache mode, > > > > leaving your cache tier free to just handle writes and thus giving it > > > > more space to work with. > > > > > > > > Christian > > > > -- > > > > Christian Balzer Network/Systems Engineer > > > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > > > > https://urldefense.proofpoint.com/v2/url?u=http- > 3A__www.gol.com_&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC- > WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc- > TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav- > 0J24ys7S21EhX394KIorQ-E&s=D1RK9OOi5QmOFzGURxC82nkUr7mAe2- > Ifo2FNgqYVQY&e= > > > > > > > > > ---------------------------------------------------------------------- > > > The information contained in this transmission may be confidential. Any > > > disclosure, copying, or further distribution of confidential information > > > is not permitted unless such privilege is explicitly granted in writing > > > by Quantum. Quantum reserves the right to have electronic > > > communications, including email and attachments, sent across its > > > networks filtered through anti virus and spam software programs and > > > retain such messages in order to comply with applicable data security > > > and retention requirements. Quantum is not responsible for the proper > > > and complete transmission of the substance of this communication or for > > > any delay in its receipt. > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com