Re: ceph cache tier clean rate too low

Josef Johansson <josef86@xxxxxxxxx> · Wed, 20 Apr 2016 07:56:32 +0200



Hi,
response in line
On 20 Apr 2016 7:45 a.m., "Christian Balzer" <chibi@xxxxxxx> wrote:

>

>

> Hello,

>

> On Wed, 20 Apr 2016 03:42:00 +0000 Stephen Lord wrote:

>

> >

> > OK, you asked ;-)

> >

>

> I certainly did. ^o^

>

> > This is all via RBD, I am running a single filesystem on top of 8 RBD

> > devices in an effort to get data striping across more OSDs, I had been

> > using that setup before adding the cache tier.

> >

> Nods.

> Depending on your use case (sequential writes) actual RADOS striping might

> be more advantageous than this (with 4MB writes still going to the same

> PG/OSD all the time).

>

>

> > 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is

> > setup with replication size 3. No SSDs involved in those OSDs, since

> > ceph-disk does not let you break a bluestore configuration into more

> > than one device at the moment.

> >

> That's a pity, but supposedly just  a limitation of ceph-disk.

> I'd venture you can work around that with symlinks to a raw SSD

> partition, same as with current filestore journals.

>

> As Sage recently wrote:

> ---

> BlueStore can use as many as three devices: one for the WAL (journal,

> though it can be much smaller than FileStores, e.g., 128MB), one for

> metadata (e.g., an SSD partition), and one for data.

> ---
I believe he also mentioned the use of bcache and friends for the osd, maybe a way forward in this case?
Regards

Josef

>

> > The 600 Mbytes/sec is an approx sustained number for the data rate I can

> > get going into this pool via RBD, that turns into 3 times that for raw

> > data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have

> > pushed it harder than that from time to time, but the OSD really wants

> > to use fdatasync a lot and that tends to suck up a lot of the potential

> > of a device, these disks will do 160 Mbytes/sec if you stream data to

> > them.

> >

> > I just checked with rados bench to this set of 33 OSDs with a 3 replica

> > pool, and 600 Mbytes/sec is what it will do from the same client host.

> >

> This matches a cluster of mine with 32 OSDs (filestore of course) and SSD

> journals on 4 nodes with a replica of 3.

>

> So BlueStore is indeed faster than than filestore.

>

> > All the networking is 40 GB ethernet, single port per host, generally I

> > can push 2.2 Gbytes/sec in one direction between two hosts over a single

> > tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a

> > node. Short of going to RDMA that appears to be about the limit for

> > these processors.

> >

> Yeah, didn't expect your network to be involved here bottleneck wise, but

> a good data point to have nevertheless.

>

> > There are a grand total of 2 400 GB P3700s which are running a pool with

> > a replication factor of 1, these are in 2 other nodes. Once I add in

> > replication perf goes downhill. If I had more hardware I would be

> > running more of these and using replication, but I am out of network

> > cards right now.

> >

> Alright, so at 900MB/s you're pretty close to what one would expect from 2

> of these: 1080MB/s*2/2(journal).

>

> How much downhill is that?

>

> I have a production cache tier with 2 nodes (replica 2 of course) and 4

> 800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the performance

> is pretty much what I would expect.

>

> > So 5 nodes running OSDs, and a 6th node running the RBD client using the

> > kernel implementation.

> >

> I assume there's are reason for use the kernel RBD client (which kernel?),

> given that it tends to be behind the curve in terms of features and speed?

>

> > Complete set of commands for creating the cache tier, I pulled this from

> > history, so the line in the middle was a failed command actually so

> > sorry for the red herring.

> >

> >   982  ceph osd pool create nvme 512 512 replicated_nvme

> >   983  ceph osd pool set nvme size 1

> >   984  ceph osd tier add rbd nvme

> >   985  ceph osd tier cache-mode  nvme writeback

> >   986  ceph osd tier set-overlay rbd nvme

> >   987  ceph osd pool set nvme  hit_set_type bloom

> >   988  ceph osd pool set target_max_bytes 500000000000 <<—— typo here,

> > so never mind 989  ceph osd pool set nvme target_max_bytes 500000000000

> >   990  ceph osd pool set nvme target_max_objects 500000

> >   991  ceph osd pool set nvme cache_target_dirty_ratio 0.5

> >   992  ceph osd pool set nvme cache_target_full_ratio 0.8

> >

> > I wish the cache tier would cause a health warning if it does not have

> > a max size set, it lets you do that, flushes nothing and fills the OSDs.

> >

> Oh yes, people have been bitten by this over and over again.

> At least it's documented now.

>

> > As for what the actual test is, this is 4K uncompressed DPX video frames,

> > so 50 Mbyte files written at least 24 a second on a good day, ideally

> > more. This needs to sustain around 1.3 Gbytes/sec in either direction

> > from a single application and needs to do it consistently. There is a

> > certain amount of buffering to deal with fluctuations in perf. I am

> > pushing 4096 of these files sequentially with a queue depth of 32 so

> > there is rather a lot of data in flight at any one time. I know I do not

> > have enough hardware to achieve this rate on writes.

> >

> So this is your test AND actual intended use case I presume, right?

>

> > The are being written with direct I/O into a pool of 8 RBD LUNs. The 8

> > LUN setup will not really help here with the small number of OSDs in the

> > cache pool, it does help when the RBD LUNs are going directly to a large

> > pool of disk based OSDs as it gets all the OSDs moving in parallel.

> >

> > My basic point here is that there is a lot more potential bandwidth to

> > be had in the backing pool, but I cannot get the cache tier to use more

> > than a small fraction of the available bandwidth when flushing content.

> > Since the front end of the cache can sustain around 900 Mbytes/sec over

> > RBD, I am somewhat out of balance here:

> >

> > cache input rate 900 Mbytes/sec

> > backing pool input rate 600 Mbytes/sec

> >

> > But not by a significant amount.

> >

> > The question is really about is there anything I can do to get cache

> > flushing to take advantage of more of the bandwidth. If I do this

> > without the cache tier then the latency of the disk based OSDs is too

> > variable and you cannot sustain a consistent data rate.

>

> This should hopefully be reduced by having WAL and metadata on SSDs with

> bluestore.

> But HDD based storage will be more jittery, that's a given.

>

> >The NVMe devices

> > are better about consistent device latency, but the cache tier

> > implementation seems to have a problem driving the backing pool at

> > anything close to its capabilities. It really only needs to move 40 or

> > 50 objects in parallel to achieve that.

> >

> And this is clearly where the Ceph (cache-tier) code could probably use

> some attention.

> Mind you, most people have the exact OPPOSITE requirement of yours, they

> want a steady, slow, low-impact stream of flushes to the HDDs, not an

> avalanche. ^o^

>

> We know that this is all based on per PG ratios, so if one PG goes over

> the dirty ratio the tier-agent will start flushing objects from it.

> The singular spikes you're seeing.

>

> My bet is that this process is sequential and not parallel (or at least

> not massively so), meaning that until one PG has finished flushing,

> the next dirty'ish one won't start flushing.

> Would be nice to have this confirmed by somebody familiar with the code,

> though.

>

> Of course having fast OSDs in the backing pool will alleviate this

> somewhat.

>

> What's your cache_target_dirty_high_ratio set to and have you tried

> setting it below/identical to cache_target_dirty_ratio?

>

> Christian

>

> > I am not attempting to provision a cache tier large enough for whole

> > workload, but as more of a debounce zone to avoid jitter making it back

> > to the application. I am trying to categorize what can and cannot be

> > achieved with ceph here for this type of workload, not build a complete

> > production setup. My test represents 170 seconds of content and

> > generates 209 Gbytes of data, so this is a small scale test ;-)

> > fortunately this stuff is not always used realtime.

> >

> > All of those extra config options look to be around how fast promotion

> > into the cache can go, not how fast you can get things out of it :-(

> >

> > I have been using readforward and that is working OK, there is

> > sufficient read bandwidth that it does not matter if data is coming from

> > the cache pool or the disk backing pool.

> >

> > Steve

> >

> >

> > > On Apr 19, 2016, at 7:47 PM, Christian Balzer <chibi@xxxxxxx> wrote:

> > >

> > >

> > > Hello,

> > >

> > > On Tue, 19 Apr 2016 20:21:39 +0000 Stephen Lord wrote:

> > >

> > >>

> > >>

> > >> I Have a setup using some Intel P3700 devices as a cache tier, and 33

> > >> sata drives hosting the pool behind them.

> > >

> > > A bit more details about the setup would be nice, as in how many nodes,

> > > interconnect, replication size of the cache tier and the backing HDD

> > > pool, etc.

> > > And "some" isn't a number, how many P3700s (which size?) in how many

> > > nodes? One assumes there are no further SSDs involved with those SATA

> > > HDDs?

> >

> > >

> > >> I setup the cache tier with

> > >> writeback, gave it a size and max object count etc:

> > >>

> > >> ceph osd pool set target_max_bytes 500000000000

> > >                    ^^^

> > > This should have given you an error, it needs the pool name, as in your

> > > next line.

> > >

> > >> ceph osd pool set nvme target_max_bytes 500000000000

> > >> ceph osd pool set nvme target_max_objects 500000

> > >> ceph osd pool set nvme cache_target_dirty_ratio 0.5

> > >> ceph osd pool set nvme cache_target_full_ratio 0.8

> > >>

> > >> This is all running Jewel using bluestore OSDs (I know experimental).

> > > Make sure to report all pyrotechnics, trap doors and sharp edges. ^_-

> > >

> > >> The cache tier will write at about 900 Mbytes/sec and read at 2.2

> > >> Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in

> > >> aggregate.

> > >  ^^^^^^^^^

> > > Key word there.

> > >

> > > That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty

> > > disappointing result for the supposedly twice as fast BlueStore.

> > > Again, replication size and topology might explain that up to a point,

> > > but we don't know them (yet).

> > >

> > > Also exact methodology of your tests please, i.e. the fio command

> > > line, how was the RBD device (if you tested with one) mounted and

> > > where, etc...

> > >

> > >> However, it looks like the mechanism for cleaning the cache

> > >> down to the disk layer is being massively rate limited and I see about

> > >> 47 Mbytes/sec of read activity from each SSD while this is going on.

> > >>

> > > This number is meaningless w/o knowing home many NVMe's you have.

> > > That being said, there are 2 levels of flushing past Hammer, but if you

> > > push the cache tier to the 2nd limit (cache_target_dirty_high_ratio)

> > > you will get full speed.

> > >

> > >> This means that while I could be pushing data into the cache at high

> > >> speed, It cannot evict old content very fast at all, and it is very

> > >> easy to hit the high water mark and the application I/O drops

> > >> dramatically as it becomes throttled by how fast the cache can flush.

> > >>

> > >> I suspect it is operating on a placement group at a time so ends up

> > >> targeting a very limited number of objects and hence disks at any one

> > >> time. I can see individual disk drives going busy for very short

> > >> periods, but most of them are idle at any one point in time. The only

> > >> way to drive the disk based OSDs fast is to hit a lot of them at once

> > >> which would mean issuing many cache flush operations in parallel.

> > >>

> > > Yes, it is all PG based, so your observations match the expectations

> > > and what everybody else is seeing.

> > > See also the thread "Cache tier operation clarifications" by me,

> > > version 2 is in the works.

> > > There are also some new knobs in Jewel that may be helpful, see:

> > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spinics.net_lists_ceph-2Dusers_msg25679.html&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=XV0tKumf_xV99IUKYgnbmJrvhbL9I5Fdk1eCwG-YUYQ&e=

> > >

> > > If you have a use case with a clearly defined idle/low use time and a

> > > small enough growth in dirty objects, consider what I'm doing,

> > > dropping the cache_target_dirty_ratio a few percent (in my case 2-3%

> > > is enough for a whole day) via cron job,wait a bit and then up again

> > > to it's normal value.

> > >

> > > That way flushes won't normally happen at all during your peak usage

> > > times, though in my case that's purely cosmetic, flushes are not

> > > problematic at any time in that cluster currently.

> > >

> > >> Are there any controls which can influence this behavior?

> > >>

> > > See above (cache_target_dirty_high_ratio).

> > >

> > > Aside from that you might want to reflect on what your use case,

> > > workload is going to be and how your testing reflects on it.

> > >

> > > As in, are you really going to write MASSIVE amounts of data at very

> > > high speeds or is it (like in 90% of common cases) the amount of small

> > > write IOPS that is really going to be the limiting factor.

> > > Which is something that cache tiers can deal with very well (or

> > > sufficiently large and well designed "plain" clusters).

> > >

> > > Another thing to think about is using the "readforward" cache mode,

> > > leaving your cache tier free to just handle writes and thus giving it

> > > more space to work with.

> > >

> > > Christian

> > > --

> > > Christian Balzer        Network/Systems Engineer

> > > chibi@xxxxxxx       Global OnLine Japan/Rakuten Communications

> > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gol.com_&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=D1RK9OOi5QmOFzGURxC82nkUr7mAe2-Ifo2FNgqYVQY&e=

> >

> >

> > ----------------------------------------------------------------------

> > The information contained in this transmission may be confidential. Any

> > disclosure, copying, or further distribution of confidential information

> > is not permitted unless such privilege is explicitly granted in writing

> > by Quantum. Quantum reserves the right to have electronic

> > communications, including email and attachments, sent across its

> > networks filtered through anti virus and spam software programs and

> > retain such messages in order to comply with applicable data security

> > and retention requirements. Quantum is not responsible for the proper

> > and complete transmission of the substance of this communication or for

> > any delay in its receipt.

>

>

> --

> Christian Balzer        Network/Systems Engineer

> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

> http://www.gol.com/

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com