Re: ceph cache tier clean rate too low

Christian Balzer <chibi@xxxxxxx> · Wed, 20 Apr 2016 14:44:59 +0900

Hello,

On Wed, 20 Apr 2016 03:42:00 +0000 Stephen Lord wrote:

> 
> OK, you asked ;-)
>

I certainly did. ^o^

> This is all via RBD, I am running a single filesystem on top of 8 RBD
> devices in an effort to get data striping across more OSDs, I had been
> using that setup before adding the cache tier.
>
Nods.
Depending on your use case (sequential writes) actual RADOS striping might
be more advantageous than this (with 4MB writes still going to the same
PG/OSD all the time).

> 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is
> setup with replication size 3. No SSDs involved in those OSDs, since
> ceph-disk does not let you break a bluestore configuration into more
> than one device at the moment.
> 
That's a pity, but supposedly just  a limitation of ceph-disk. 
I'd venture you can work around that with symlinks to a raw SSD
partition, same as with current filestore journals.

As Sage recently wrote:
---
BlueStore can use as many as three devices: one for the WAL (journal, 
though it can be much smaller than FileStores, e.g., 128MB), one for 
metadata (e.g., an SSD partition), and one for data.
---

> The 600 Mbytes/sec is an approx sustained number for the data rate I can
> get going into this pool via RBD, that turns into 3 times that for raw
> data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have
> pushed it harder than that from time to time, but the OSD really wants
> to use fdatasync a lot and that tends to suck up a lot of the potential
> of a device, these disks will do 160 Mbytes/sec if you stream data to
> them.
> 
> I just checked with rados bench to this set of 33 OSDs with a 3 replica
> pool, and 600 Mbytes/sec is what it will do from the same client host.
> 
This matches a cluster of mine with 32 OSDs (filestore of course) and SSD
journals on 4 nodes with a replica of 3.

So BlueStore is indeed faster than than filestore.

> All the networking is 40 GB ethernet, single port per host, generally I
> can push 2.2 Gbytes/sec in one direction between two hosts over a single
> tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a
> node. Short of going to RDMA that appears to be about the limit for
> these processors.
> 
Yeah, didn't expect your network to be involved here bottleneck wise, but
a good data point to have nevertheless. 

> There are a grand total of 2 400 GB P3700s which are running a pool with
> a replication factor of 1, these are in 2 other nodes. Once I add in
> replication perf goes downhill. If I had more hardware I would be
> running more of these and using replication, but I am out of network
> cards right now.
> 
Alright, so at 900MB/s you're pretty close to what one would expect from 2
of these: 1080MB/s*2/2(journal).

How much downhill is that?

I have a production cache tier with 2 nodes (replica 2 of course) and 4
800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the performance
is pretty much what I would expect.

> So 5 nodes running OSDs, and a 6th node running the RBD client using the
> kernel implementation.
> 
I assume there's are reason for use the kernel RBD client (which kernel?),
given that it tends to be behind the curve in terms of features and speed?

> Complete set of commands for creating the cache tier, I pulled this from
> history, so the line in the middle was a failed command actually so
> sorry for the red herring.
> 
>   982  ceph osd pool create nvme 512 512 replicated_nvme 
>   983  ceph osd pool set nvme size 1
>   984  ceph osd tier add rbd nvme
>   985  ceph osd tier cache-mode  nvme writeback
>   986  ceph osd tier set-overlay rbd nvme 
>   987  ceph osd pool set nvme  hit_set_type bloom 
>   988  ceph osd pool set target_max_bytes 500000000000 <<—— typo here,
> so never mind 989  ceph osd pool set nvme target_max_bytes 500000000000
>   990  ceph osd pool set nvme target_max_objects 500000
>   991  ceph osd pool set nvme cache_target_dirty_ratio 0.5
>   992  ceph osd pool set nvme cache_target_full_ratio 0.8
> 
> I wish the cache tier would cause a health warning if it does not have
> a max size set, it lets you do that, flushes nothing and fills the OSDs.
> 
Oh yes, people have been bitten by this over and over again.
At least it's documented now.

> As for what the actual test is, this is 4K uncompressed DPX video frames,
> so 50 Mbyte files written at least 24 a second on a good day, ideally
> more. This needs to sustain around 1.3 Gbytes/sec in either direction
> from a single application and needs to do it consistently. There is a
> certain amount of buffering to deal with fluctuations in perf. I am
> pushing 4096 of these files sequentially with a queue depth of 32 so
> there is rather a lot of data in flight at any one time. I know I do not
> have enough hardware to achieve this rate on writes.
>
So this is your test AND actual intended use case I presume, right? 

> The are being written with direct I/O into a pool of 8 RBD LUNs. The 8
> LUN setup will not really help here with the small number of OSDs in the
> cache pool, it does help when the RBD LUNs are going directly to a large
> pool of disk based OSDs as it gets all the OSDs moving in parallel.
> 
> My basic point here is that there is a lot more potential bandwidth to
> be had in the backing pool, but I cannot get the cache tier to use more
> than a small fraction of the available bandwidth when flushing content.
> Since the front end of the cache can sustain around 900 Mbytes/sec over
> RBD, I am somewhat out of balance here:
> 
> cache input rate 900 Mbytes/sec
> backing pool input rate 600 Mbytes/sec
> 
> But not by a significant amount.
> 
> The question is really about is there anything I can do to get cache
> flushing to take advantage of more of the bandwidth. If I do this
> without the cache tier then the latency of the disk based OSDs is too
> variable and you cannot sustain a consistent data rate. 

This should hopefully be reduced by having WAL and metadata on SSDs with
bluestore.
But HDD based storage will be more jittery, that's a given.

>The NVMe devices
> are better about consistent device latency, but the cache tier
> implementation seems to have a problem driving the backing pool at
> anything close to its capabilities. It really only needs to move 40 or
> 50 objects in parallel to achieve that.
> 
And this is clearly where the Ceph (cache-tier) code could probably use
some attention.
Mind you, most people have the exact OPPOSITE requirement of yours, they
want a steady, slow, low-impact stream of flushes to the HDDs, not an
avalanche. ^o^

We know that this is all based on per PG ratios, so if one PG goes over
the dirty ratio the tier-agent will start flushing objects from it.
The singular spikes you're seeing.

My bet is that this process is sequential and not parallel (or at least
not massively so), meaning that until one PG has finished flushing,
the next dirty'ish one won't start flushing.
Would be nice to have this confirmed by somebody familiar with the code,
though.

Of course having fast OSDs in the backing pool will alleviate this
somewhat.

What's your cache_target_dirty_high_ratio set to and have you tried
setting it below/identical to cache_target_dirty_ratio?

Christian

> I am not attempting to provision a cache tier large enough for whole
> workload, but as more of a debounce zone to avoid jitter making it back
> to the application. I am trying to categorize what can and cannot be
> achieved with ceph here for this type of workload, not build a complete
> production setup. My test represents 170 seconds of content and
> generates 209 Gbytes of data, so this is a small scale test ;-)
> fortunately this stuff is not always used realtime.
> 
> All of those extra config options look to be around how fast promotion
> into the cache can go, not how fast you can get things out of it :-(
> 
> I have been using readforward and that is working OK, there is
> sufficient read bandwidth that it does not matter if data is coming from
> the cache pool or the disk backing pool.
> 
> Steve
> 
> 
> > On Apr 19, 2016, at 7:47 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> > 
> > 
> > Hello,
> > 
> > On Tue, 19 Apr 2016 20:21:39 +0000 Stephen Lord wrote:
> > 
> >> 
> >> 
> >> I Have a setup using some Intel P3700 devices as a cache tier, and 33
> >> sata drives hosting the pool behind them. 
> > 
> > A bit more details about the setup would be nice, as in how many nodes,
> > interconnect, replication size of the cache tier and the backing HDD
> > pool, etc. 
> > And "some" isn't a number, how many P3700s (which size?) in how many
> > nodes? One assumes there are no further SSDs involved with those SATA
> > HDDs?
> 
> > 
> >> I setup the cache tier with
> >> writeback, gave it a size and max object count etc:
> >> 
> >> ceph osd pool set target_max_bytes 500000000000
> >                    ^^^
> > This should have given you an error, it needs the pool name, as in your
> > next line.
> > 
> >> ceph osd pool set nvme target_max_bytes 500000000000
> >> ceph osd pool set nvme target_max_objects 500000
> >> ceph osd pool set nvme cache_target_dirty_ratio 0.5
> >> ceph osd pool set nvme cache_target_full_ratio 0.8
> >> 
> >> This is all running Jewel using bluestore OSDs (I know experimental).
> > Make sure to report all pyrotechnics, trap doors and sharp edges. ^_-
> > 
> >> The cache tier will write at about 900 Mbytes/sec and read at 2.2
> >> Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in
> >> aggregate. 
> >  ^^^^^^^^^
> > Key word there.
> > 
> > That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty
> > disappointing result for the supposedly twice as fast BlueStore. 
> > Again, replication size and topology might explain that up to a point,
> > but we don't know them (yet).
> > 
> > Also exact methodology of your tests please, i.e. the fio command
> > line, how was the RBD device (if you tested with one) mounted and
> > where, etc...
> > 
> >> However, it looks like the mechanism for cleaning the cache
> >> down to the disk layer is being massively rate limited and I see about
> >> 47 Mbytes/sec of read activity from each SSD while this is going on.
> >> 
> > This number is meaningless w/o knowing home many NVMe's you have.
> > That being said, there are 2 levels of flushing past Hammer, but if you
> > push the cache tier to the 2nd limit (cache_target_dirty_high_ratio)
> > you will get full speed.
> > 
> >> This means that while I could be pushing data into the cache at high
> >> speed, It cannot evict old content very fast at all, and it is very
> >> easy to hit the high water mark and the application I/O drops
> >> dramatically as it becomes throttled by how fast the cache can flush.
> >> 
> >> I suspect it is operating on a placement group at a time so ends up
> >> targeting a very limited number of objects and hence disks at any one
> >> time. I can see individual disk drives going busy for very short
> >> periods, but most of them are idle at any one point in time. The only
> >> way to drive the disk based OSDs fast is to hit a lot of them at once
> >> which would mean issuing many cache flush operations in parallel.
> >> 
> > Yes, it is all PG based, so your observations match the expectations
> > and what everybody else is seeing. 
> > See also the thread "Cache tier operation clarifications" by me,
> > version 2 is in the works.
> > There are also some new knobs in Jewel that may be helpful, see:
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spinics.net_lists_ceph-2Dusers_msg25679.html&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=XV0tKumf_xV99IUKYgnbmJrvhbL9I5Fdk1eCwG-YUYQ&e= 
> > 
> > If you have a use case with a clearly defined idle/low use time and a
> > small enough growth in dirty objects, consider what I'm doing,
> > dropping the cache_target_dirty_ratio a few percent (in my case 2-3%
> > is enough for a whole day) via cron job,wait a bit and then up again
> > to it's normal value. 
> > 
> > That way flushes won't normally happen at all during your peak usage
> > times, though in my case that's purely cosmetic, flushes are not
> > problematic at any time in that cluster currently.
> > 
> >> Are there any controls which can influence this behavior?
> >> 
> > See above (cache_target_dirty_high_ratio).
> > 
> > Aside from that you might want to reflect on what your use case,
> > workload is going to be and how your testing reflects on it.
> > 
> > As in, are you really going to write MASSIVE amounts of data at very
> > high speeds or is it (like in 90% of common cases) the amount of small
> > write IOPS that is really going to be the limiting factor. 
> > Which is something that cache tiers can deal with very well (or
> > sufficiently large and well designed "plain" clusters).
> > 
> > Another thing to think about is using the "readforward" cache mode,
> > leaving your cache tier free to just handle writes and thus giving it
> > more space to work with.
> > 
> > Christian
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gol.com_&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=D1RK9OOi5QmOFzGURxC82nkUr7mAe2-Ifo2FNgqYVQY&e= 
> 
> 
> ----------------------------------------------------------------------
> The information contained in this transmission may be confidential. Any
> disclosure, copying, or further distribution of confidential information
> is not permitted unless such privilege is explicitly granted in writing
> by Quantum. Quantum reserves the right to have electronic
> communications, including email and attachments, sent across its
> networks filtered through anti virus and spam software programs and
> retain such messages in order to comply with applicable data security
> and retention requirements. Quantum is not responsible for the proper
> and complete transmission of the substance of this communication or for
> any delay in its receipt.

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com