Re: ceph cache tier clean rate too low

Christian Balzer <chibi@xxxxxxx> · Wed, 20 Apr 2016 09:47:28 +0900

Hello,

On Tue, 19 Apr 2016 20:21:39 +0000 Stephen Lord wrote:

> 
> 
> I Have a setup using some Intel P3700 devices as a cache tier, and 33
> sata drives hosting the pool behind them. 

A bit more details about the setup would be nice, as in how many nodes,
interconnect, replication size of the cache tier and the backing HDD
pool, etc. 
And "some" isn't a number, how many P3700s (which size?) in how many nodes?
One assumes there are no further SSDs involved with those SATA HDDs?

>I setup the cache tier with
> writeback, gave it a size and max object count etc:
> 
>  ceph osd pool set target_max_bytes 500000000000
                    ^^^
This should have given you an error, it needs the pool name, as in your
next line.

>  ceph osd pool set nvme target_max_bytes 500000000000
>  ceph osd pool set nvme target_max_objects 500000
>  ceph osd pool set nvme cache_target_dirty_ratio 0.5
>  ceph osd pool set nvme cache_target_full_ratio 0.8
> 
> This is all running Jewel using bluestore OSDs (I know experimental).
Make sure to report all pyrotechnics, trap doors and sharp edges. ^_-

> The cache tier will write at about 900 Mbytes/sec and read at 2.2
> Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in
> aggregate. 
  ^^^^^^^^^
Key word there.

That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty
disappointing result for the supposedly twice as fast BlueStore. 
Again, replication size and topology might explain that up to a point, but
we don't know them (yet).

Also exact methodology of your tests please, i.e. the fio command line, how
was the RBD device (if you tested with one) mounted and where, etc...

> However, it looks like the mechanism for cleaning the cache
> down to the disk layer is being massively rate limited and I see about
> 47 Mbytes/sec of read activity from each SSD while this is going on.
> 
This number is meaningless w/o knowing home many NVMe's you have.
That being said, there are 2 levels of flushing past Hammer, but if you
push the cache tier to the 2nd limit (cache_target_dirty_high_ratio) you
will get full speed.

> This means that while I could be pushing data into the cache at high
> speed, It cannot evict old content very fast at all, and it is very easy
> to hit the high water mark and the application I/O drops dramatically as
> it becomes throttled by how fast the cache can flush.
> 
> I suspect it is operating on a placement group at a time so ends up
> targeting a very limited number of objects and hence disks at any one
> time. I can see individual disk drives going busy for very short
> periods, but most of them are idle at any one point in time. The only
> way to drive the disk based OSDs fast is to hit a lot of them at once
> which would mean issuing many cache flush operations in parallel.
>
Yes, it is all PG based, so your observations match the expectations and
what everybody else is seeing. 
See also the thread "Cache tier operation clarifications" by me, version 2
is in the works.
There are also some new knobs in Jewel that may be helpful, see:
http://www.spinics.net/lists/ceph-users/msg25679.html

If you have a use case with a clearly defined idle/low use time and a
small enough growth in dirty objects, consider what I'm doing, dropping the
cache_target_dirty_ratio a few percent (in my case 2-3% is enough for a
whole day) via cron job,wait a bit and then up again to it's normal value. 

That way flushes won't normally happen at all during your peak usage
times, though in my case that's purely cosmetic, flushes are not
problematic at any time in that cluster currently.

> Are there any controls which can influence this behavior?
> 
See above (cache_target_dirty_high_ratio).

Aside from that you might want to reflect on what your use case, workload
is going to be and how your testing reflects on it.

As in, are you really going to write MASSIVE amounts of data at very high
speeds or is it (like in 90% of common cases) the amount of small
write IOPS that is really going to be the limiting factor. 
Which is something that cache tiers can deal with very well (or
sufficiently large and well designed "plain" clusters).

Another thing to think about is using the "readforward" cache mode,
leaving your cache tier free to just handle writes and thus giving it more
space to work with.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com