Re: ceph cache tier clean rate too low

Nick Fisk <nick@xxxxxxxxxx> · Wed, 20 Apr 2016 07:24:37 +0100

I would advise you to take a look at the osd_agent_max_ops (and osd_agent_max_ops), these should in theory dictate how many parallel threads will be used for flushing. Do a conf dump from the admin socket to see what you are currently running with and then bump them up to see if it helps.

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Josef Johansson
> Sent: 20 April 2016 06:57
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  ceph cache tier clean rate too low
> 
> Hi,
> response in line
> On 20 Apr 2016 7:45 a.m., "Christian Balzer" <chibi@xxxxxxx> wrote:
> >
> >
> > Hello,
> >
> > On Wed, 20 Apr 2016 03:42:00 +0000 Stephen Lord wrote:
> >
> > >
> > > OK, you asked ;-)
> > >
> >
> > I certainly did. ^o^
> >
> > > This is all via RBD, I am running a single filesystem on top of 8 RBD
> > > devices in an effort to get data striping across more OSDs, I had been
> > > using that setup before adding the cache tier.
> > >
> > Nods.
> > Depending on your use case (sequential writes) actual RADOS striping
> might
> > be more advantageous than this (with 4MB writes still going to the same
> > PG/OSD all the time).
> >
> >
> > > 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is
> > > setup with replication size 3. No SSDs involved in those OSDs, since
> > > ceph-disk does not let you break a bluestore configuration into more
> > > than one device at the moment.
> > >
> > That's a pity, but supposedly just  a limitation of ceph-disk.
> > I'd venture you can work around that with symlinks to a raw SSD
> > partition, same as with current filestore journals.
> >
> > As Sage recently wrote:
> > ---
> > BlueStore can use as many as three devices: one for the WAL (journal,
> > though it can be much smaller than FileStores, e.g., 128MB), one for
> > metadata (e.g., an SSD partition), and one for data.
> > ---
> I believe he also mentioned the use of bcache and friends for the osd,
> maybe a way forward in this case?
> Regards
> Josef
> >
> > > The 600 Mbytes/sec is an approx sustained number for the data rate I can
> > > get going into this pool via RBD, that turns into 3 times that for raw
> > > data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have
> > > pushed it harder than that from time to time, but the OSD really wants
> > > to use fdatasync a lot and that tends to suck up a lot of the potential
> > > of a device, these disks will do 160 Mbytes/sec if you stream data to
> > > them.
> > >
> > > I just checked with rados bench to this set of 33 OSDs with a 3 replica
> > > pool, and 600 Mbytes/sec is what it will do from the same client host.
> > >
> > This matches a cluster of mine with 32 OSDs (filestore of course) and SSD
> > journals on 4 nodes with a replica of 3.
> >
> > So BlueStore is indeed faster than than filestore.
> >
> > > All the networking is 40 GB ethernet, single port per host, generally I
> > > can push 2.2 Gbytes/sec in one direction between two hosts over a single
> > > tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a
> > > node. Short of going to RDMA that appears to be about the limit for
> > > these processors.
> > >
> > Yeah, didn't expect your network to be involved here bottleneck wise, but
> > a good data point to have nevertheless.
> >
> > > There are a grand total of 2 400 GB P3700s which are running a pool with
> > > a replication factor of 1, these are in 2 other nodes. Once I add in
> > > replication perf goes downhill. If I had more hardware I would be
> > > running more of these and using replication, but I am out of network
> > > cards right now.
> > >
> > Alright, so at 900MB/s you're pretty close to what one would expect from 2
> > of these: 1080MB/s*2/2(journal).
> >
> > How much downhill is that?
> >
> > I have a production cache tier with 2 nodes (replica 2 of course) and 4
> > 800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the
> performance
> > is pretty much what I would expect.
> >
> > > So 5 nodes running OSDs, and a 6th node running the RBD client using the
> > > kernel implementation.
> > >
> > I assume there's are reason for use the kernel RBD client (which kernel?),
> > given that it tends to be behind the curve in terms of features and speed?
> >
> > > Complete set of commands for creating the cache tier, I pulled this from
> > > history, so the line in the middle was a failed command actually so
> > > sorry for the red herring.
> > >
> > >   982  ceph osd pool create nvme 512 512 replicated_nvme
> > >   983  ceph osd pool set nvme size 1
> > >   984  ceph osd tier add rbd nvme
> > >   985  ceph osd tier cache-mode  nvme writeback
> > >   986  ceph osd tier set-overlay rbd nvme
> > >   987  ceph osd pool set nvme  hit_set_type bloom
> > >   988  ceph osd pool set target_max_bytes 500000000000 <<—— typo
> here,
> > > so never mind 989  ceph osd pool set nvme target_max_bytes
> 500000000000
> > >   990  ceph osd pool set nvme target_max_objects 500000
> > >   991  ceph osd pool set nvme cache_target_dirty_ratio 0.5
> > >   992  ceph osd pool set nvme cache_target_full_ratio 0.8
> > >
> > > I wish the cache tier would cause a health warning if it does not have
> > > a max size set, it lets you do that, flushes nothing and fills the OSDs.
> > >
> > Oh yes, people have been bitten by this over and over again.
> > At least it's documented now.
> >
> > > As for what the actual test is, this is 4K uncompressed DPX video frames,
> > > so 50 Mbyte files written at least 24 a second on a good day, ideally
> > > more. This needs to sustain around 1.3 Gbytes/sec in either direction
> > > from a single application and needs to do it consistently. There is a
> > > certain amount of buffering to deal with fluctuations in perf. I am
> > > pushing 4096 of these files sequentially with a queue depth of 32 so
> > > there is rather a lot of data in flight at any one time. I know I do not
> > > have enough hardware to achieve this rate on writes.
> > >
> > So this is your test AND actual intended use case I presume, right?
> >
> > > The are being written with direct I/O into a pool of 8 RBD LUNs. The 8
> > > LUN setup will not really help here with the small number of OSDs in the
> > > cache pool, it does help when the RBD LUNs are going directly to a large
> > > pool of disk based OSDs as it gets all the OSDs moving in parallel.
> > >
> > > My basic point here is that there is a lot more potential bandwidth to
> > > be had in the backing pool, but I cannot get the cache tier to use more
> > > than a small fraction of the available bandwidth when flushing content.
> > > Since the front end of the cache can sustain around 900 Mbytes/sec over
> > > RBD, I am somewhat out of balance here:
> > >
> > > cache input rate 900 Mbytes/sec
> > > backing pool input rate 600 Mbytes/sec
> > >
> > > But not by a significant amount.
> > >
> > > The question is really about is there anything I can do to get cache
> > > flushing to take advantage of more of the bandwidth. If I do this
> > > without the cache tier then the latency of the disk based OSDs is too
> > > variable and you cannot sustain a consistent data rate.
> >
> > This should hopefully be reduced by having WAL and metadata on SSDs
> with
> > bluestore.
> > But HDD based storage will be more jittery, that's a given.
> >
> > >The NVMe devices
> > > are better about consistent device latency, but the cache tier
> > > implementation seems to have a problem driving the backing pool at
> > > anything close to its capabilities. It really only needs to move 40 or
> > > 50 objects in parallel to achieve that.
> > >
> > And this is clearly where the Ceph (cache-tier) code could probably use
> > some attention.
> > Mind you, most people have the exact OPPOSITE requirement of yours,
> they
> > want a steady, slow, low-impact stream of flushes to the HDDs, not an
> > avalanche. ^o^
> >
> > We know that this is all based on per PG ratios, so if one PG goes over
> > the dirty ratio the tier-agent will start flushing objects from it.
> > The singular spikes you're seeing.
> >
> > My bet is that this process is sequential and not parallel (or at least
> > not massively so), meaning that until one PG has finished flushing,
> > the next dirty'ish one won't start flushing.
> > Would be nice to have this confirmed by somebody familiar with the code,
> > though.
> >
> > Of course having fast OSDs in the backing pool will alleviate this
> > somewhat.
> >
> > What's your cache_target_dirty_high_ratio set to and have you tried
> > setting it below/identical to cache_target_dirty_ratio?
> >
> > Christian
> >
> > > I am not attempting to provision a cache tier large enough for whole
> > > workload, but as more of a debounce zone to avoid jitter making it back
> > > to the application. I am trying to categorize what can and cannot be
> > > achieved with ceph here for this type of workload, not build a complete
> > > production setup. My test represents 170 seconds of content and
> > > generates 209 Gbytes of data, so this is a small scale test ;-)
> > > fortunately this stuff is not always used realtime.
> > >
> > > All of those extra config options look to be around how fast promotion
> > > into the cache can go, not how fast you can get things out of it :-(
> > >
> > > I have been using readforward and that is working OK, there is
> > > sufficient read bandwidth that it does not matter if data is coming from
> > > the cache pool or the disk backing pool.
> > >
> > > Steve
> > >
> > >
> > > > On Apr 19, 2016, at 7:47 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> > > >
> > > >
> > > > Hello,
> > > >
> > > > On Tue, 19 Apr 2016 20:21:39 +0000 Stephen Lord wrote:
> > > >
> > > >>
> > > >>
> > > >> I Have a setup using some Intel P3700 devices as a cache tier, and 33
> > > >> sata drives hosting the pool behind them.
> > > >
> > > > A bit more details about the setup would be nice, as in how many
> nodes,
> > > > interconnect, replication size of the cache tier and the backing HDD
> > > > pool, etc.
> > > > And "some" isn't a number, how many P3700s (which size?) in how
> many
> > > > nodes? One assumes there are no further SSDs involved with those
> SATA
> > > > HDDs?
> > >
> > > >
> > > >> I setup the cache tier with
> > > >> writeback, gave it a size and max object count etc:
> > > >>
> > > >> ceph osd pool set target_max_bytes 500000000000
> > > >                    ^^^
> > > > This should have given you an error, it needs the pool name, as in your
> > > > next line.
> > > >
> > > >> ceph osd pool set nvme target_max_bytes 500000000000
> > > >> ceph osd pool set nvme target_max_objects 500000
> > > >> ceph osd pool set nvme cache_target_dirty_ratio 0.5
> > > >> ceph osd pool set nvme cache_target_full_ratio 0.8
> > > >>
> > > >> This is all running Jewel using bluestore OSDs (I know experimental).
> > > > Make sure to report all pyrotechnics, trap doors and sharp edges. ^_-
> > > >
> > > >> The cache tier will write at about 900 Mbytes/sec and read at 2.2
> > > >> Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in
> > > >> aggregate.
> > > >  ^^^^^^^^^
> > > > Key word there.
> > > >
> > > > That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty
> > > > disappointing result for the supposedly twice as fast BlueStore.
> > > > Again, replication size and topology might explain that up to a point,
> > > > but we don't know them (yet).
> > > >
> > > > Also exact methodology of your tests please, i.e. the fio command
> > > > line, how was the RBD device (if you tested with one) mounted and
> > > > where, etc...
> > > >
> > > >> However, it looks like the mechanism for cleaning the cache
> > > >> down to the disk layer is being massively rate limited and I see about
> > > >> 47 Mbytes/sec of read activity from each SSD while this is going on.
> > > >>
> > > > This number is meaningless w/o knowing home many NVMe's you
> have.
> > > > That being said, there are 2 levels of flushing past Hammer, but if you
> > > > push the cache tier to the 2nd limit (cache_target_dirty_high_ratio)
> > > > you will get full speed.
> > > >
> > > >> This means that while I could be pushing data into the cache at high
> > > >> speed, It cannot evict old content very fast at all, and it is very
> > > >> easy to hit the high water mark and the application I/O drops
> > > >> dramatically as it becomes throttled by how fast the cache can flush.
> > > >>
> > > >> I suspect it is operating on a placement group at a time so ends up
> > > >> targeting a very limited number of objects and hence disks at any one
> > > >> time. I can see individual disk drives going busy for very short
> > > >> periods, but most of them are idle at any one point in time. The only
> > > >> way to drive the disk based OSDs fast is to hit a lot of them at once
> > > >> which would mean issuing many cache flush operations in parallel.
> > > >>
> > > > Yes, it is all PG based, so your observations match the expectations
> > > > and what everybody else is seeing.
> > > > See also the thread "Cache tier operation clarifications" by me,
> > > > version 2 is in the works.
> > > > There are also some new knobs in Jewel that may be helpful, see:
> > > > https://urldefense.proofpoint.com/v2/url?u=http-
> 3A__www.spinics.net_lists_ceph-
> 2Dusers_msg25679.html&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-
> WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-
> TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-
> 0J24ys7S21EhX394KIorQ-
> E&s=XV0tKumf_xV99IUKYgnbmJrvhbL9I5Fdk1eCwG-YUYQ&e=
> > > >
> > > > If you have a use case with a clearly defined idle/low use time and a
> > > > small enough growth in dirty objects, consider what I'm doing,
> > > > dropping the cache_target_dirty_ratio a few percent (in my case 2-3%
> > > > is enough for a whole day) via cron job,wait a bit and then up again
> > > > to it's normal value.
> > > >
> > > > That way flushes won't normally happen at all during your peak usage
> > > > times, though in my case that's purely cosmetic, flushes are not
> > > > problematic at any time in that cluster currently.
> > > >
> > > >> Are there any controls which can influence this behavior?
> > > >>
> > > > See above (cache_target_dirty_high_ratio).
> > > >
> > > > Aside from that you might want to reflect on what your use case,
> > > > workload is going to be and how your testing reflects on it.
> > > >
> > > > As in, are you really going to write MASSIVE amounts of data at very
> > > > high speeds or is it (like in 90% of common cases) the amount of small
> > > > write IOPS that is really going to be the limiting factor.
> > > > Which is something that cache tiers can deal with very well (or
> > > > sufficiently large and well designed "plain" clusters).
> > > >
> > > > Another thing to think about is using the "readforward" cache mode,
> > > > leaving your cache tier free to just handle writes and thus giving it
> > > > more space to work with.
> > > >
> > > > Christian
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi@xxxxxxx       Global OnLine Japan/Rakuten Communications
> > > > https://urldefense.proofpoint.com/v2/url?u=http-
> 3A__www.gol.com_&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-
> WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-
> TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-
> 0J24ys7S21EhX394KIorQ-E&s=D1RK9OOi5QmOFzGURxC82nkUr7mAe2-
> Ifo2FNgqYVQY&e=
> > >
> > >
> > > ----------------------------------------------------------------------
> > > The information contained in this transmission may be confidential. Any
> > > disclosure, copying, or further distribution of confidential information
> > > is not permitted unless such privilege is explicitly granted in writing
> > > by Quantum. Quantum reserves the right to have electronic
> > > communications, including email and attachments, sent across its
> > > networks filtered through anti virus and spam software programs and
> > > retain such messages in order to comply with applicable data security
> > > and retention requirements. Quantum is not responsible for the proper
> > > and complete transmission of the substance of this communication or for
> > > any delay in its receipt.
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com