Re: ceph cache tier clean rate too low

Stephen Lord <Steve.Lord@xxxxxxxxxxx> · Wed, 20 Apr 2016 03:42:00 +0000

OK, you asked ;-)

This is all via RBD, I am running a single filesystem on top of 8 RBD devices in an
effort to get data striping across more OSDs, I had been using that setup before adding
the cache tier.

3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is setup with
replication size 3. No SSDs involved in those OSDs, since ceph-disk does not let
you break a bluestore configuration into more than one device at the moment.

The 600 Mbytes/sec is an approx sustained number for the data rate I can get going
into this pool via RBD, that turns into 3 times that for raw data rate, so at 33 drives
that is mid 50s Mbytes/sec per drive. I have pushed it harder than that from time to
time, but the OSD really wants to use fdatasync a lot and that tends to suck up a
lot of the potential of a device, these disks will do 160 Mbytes/sec if you stream
data to them.

I just checked with rados bench to this set of 33 OSDs with a 3 replica pool,
and 600 Mbytes/sec is what it will do from the same client host.

All the networking is 40 GB ethernet, single port per host, generally I can push 2.2 Gbytes/sec
in one direction between two hosts over a single tcp link, the max I have seen is about 2.7 Gbytes/sec
coming into a node. Short of going to RDMA that appears to be about the limit for these processors.

There are a grand total of 2 400 GB P3700s which are running a pool with a replication factor of 1,
these are in 2 other nodes. Once I add in replication perf goes downhill. If I had more hardware I
would be running more of these and using replication, but I am out of network cards right now.

So 5 nodes running OSDs, and a 6th node running the RBD client using the kernel implementation.

Complete set of commands for creating the cache tier, I pulled this from
history, so the line in the middle was a failed command actually so sorry for the red herring.

  982  ceph osd pool create nvme 512 512 replicated_nvme 
  983  ceph osd pool set nvme size 1
  984  ceph osd tier add rbd nvme
  985  ceph osd tier cache-mode  nvme writeback
  986  ceph osd tier set-overlay rbd nvme 
  987  ceph osd pool set nvme  hit_set_type bloom 
  988  ceph osd pool set target_max_bytes 500000000000 <<—— typo here, so never mind
  989  ceph osd pool set nvme target_max_bytes 500000000000
  990  ceph osd pool set nvme target_max_objects 500000
  991  ceph osd pool set nvme cache_target_dirty_ratio 0.5
  992  ceph osd pool set nvme cache_target_full_ratio 0.8

I wish the cache tier would cause a health warning if it does not have
a max size set, it lets you do that, flushes nothing and fills the OSDs.

As for what the actual test is, this is 4K uncompressed DPX video frames,
so 50 Mbyte files written at least 24 a second on a good day, ideally more.
This needs to sustain around 1.3 Gbytes/sec in either direction from a single
application and needs to do it consistently. There is a certain amount of 
buffering to deal with fluctuations in perf. I am pushing 4096 of these files
sequentially with a queue depth of 32 so there is rather a lot of data in flight
at any one time. I know I do not have enough hardware to achieve this rate
on writes.

The are being written with direct I/O into a pool of 8 RBD LUNs. The 8 LUN
setup will not really help here with the small number of OSDs in the cache 
pool, it does help when the RBD LUNs are going directly to a large pool of
disk based OSDs as it gets all the OSDs moving in parallel.

My basic point here is that there is a lot more potential bandwidth to be had in the
backing pool, but I cannot get the cache tier to use more than a small fraction of the
available bandwidth when flushing content. Since the front end of the cache can
sustain around 900 Mbytes/sec over RBD, I am somewhat out of balance here:

cache input rate 900 Mbytes/sec
backing pool input rate 600 Mbytes/sec

But not by a significant amount.

The question is really about is there anything I can do to get cache flushing to
take advantage of more of the bandwidth. If I do this without the cache tier then
the latency of the disk based OSDs is too variable and you cannot sustain a
consistent data rate. The NVMe devices are better about consistent device
latency, but the cache tier implementation seems to have a problem driving
the backing pool at anything close to its capabilities. It really only needs to 
move 40 or 50 objects in parallel to achieve that.

I am not attempting to provision a cache tier large enough for whole workload,
but as more of a debounce zone to avoid jitter making it back to the application.
I am trying to categorize what can and cannot be achieved with ceph here for
this type of workload, not build a complete production setup. My test represents
170 seconds of content and generates 209 Gbytes of data, so this is a small
scale test ;-) fortunately this stuff is not always used realtime.

All of those extra config options look to be around how fast promotion into the
cache can go, not how fast you can get things out of it :-(

I have been using readforward and that is working OK, there is sufficient read
bandwidth that it does not matter if data is coming from the cache pool or the
disk backing pool.

Steve

> On Apr 19, 2016, at 7:47 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> 
> Hello,
> 
> On Tue, 19 Apr 2016 20:21:39 +0000 Stephen Lord wrote:
> 
>> 
>> 
>> I Have a setup using some Intel P3700 devices as a cache tier, and 33
>> sata drives hosting the pool behind them. 
> 
> A bit more details about the setup would be nice, as in how many nodes,
> interconnect, replication size of the cache tier and the backing HDD
> pool, etc. 
> And "some" isn't a number, how many P3700s (which size?) in how many nodes?
> One assumes there are no further SSDs involved with those SATA HDDs?

> 
>> I setup the cache tier with
>> writeback, gave it a size and max object count etc:
>> 
>> ceph osd pool set target_max_bytes 500000000000
>                    ^^^
> This should have given you an error, it needs the pool name, as in your
> next line.
> 
>> ceph osd pool set nvme target_max_bytes 500000000000
>> ceph osd pool set nvme target_max_objects 500000
>> ceph osd pool set nvme cache_target_dirty_ratio 0.5
>> ceph osd pool set nvme cache_target_full_ratio 0.8
>> 
>> This is all running Jewel using bluestore OSDs (I know experimental).
> Make sure to report all pyrotechnics, trap doors and sharp edges. ^_-
> 
>> The cache tier will write at about 900 Mbytes/sec and read at 2.2
>> Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in
>> aggregate. 
>  ^^^^^^^^^
> Key word there.
> 
> That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty
> disappointing result for the supposedly twice as fast BlueStore. 
> Again, replication size and topology might explain that up to a point, but
> we don't know them (yet).
> 
> Also exact methodology of your tests please, i.e. the fio command line, how
> was the RBD device (if you tested with one) mounted and where, etc...
> 
>> However, it looks like the mechanism for cleaning the cache
>> down to the disk layer is being massively rate limited and I see about
>> 47 Mbytes/sec of read activity from each SSD while this is going on.
>> 
> This number is meaningless w/o knowing home many NVMe's you have.
> That being said, there are 2 levels of flushing past Hammer, but if you
> push the cache tier to the 2nd limit (cache_target_dirty_high_ratio) you
> will get full speed.
> 
>> This means that while I could be pushing data into the cache at high
>> speed, It cannot evict old content very fast at all, and it is very easy
>> to hit the high water mark and the application I/O drops dramatically as
>> it becomes throttled by how fast the cache can flush.
>> 
>> I suspect it is operating on a placement group at a time so ends up
>> targeting a very limited number of objects and hence disks at any one
>> time. I can see individual disk drives going busy for very short
>> periods, but most of them are idle at any one point in time. The only
>> way to drive the disk based OSDs fast is to hit a lot of them at once
>> which would mean issuing many cache flush operations in parallel.
>> 
> Yes, it is all PG based, so your observations match the expectations and
> what everybody else is seeing. 
> See also the thread "Cache tier operation clarifications" by me, version 2
> is in the works.
> There are also some new knobs in Jewel that may be helpful, see:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spinics.net_lists_ceph-2Dusers_msg25679.html&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=XV0tKumf_xV99IUKYgnbmJrvhbL9I5Fdk1eCwG-YUYQ&e= 
> 
> If you have a use case with a clearly defined idle/low use time and a
> small enough growth in dirty objects, consider what I'm doing, dropping the
> cache_target_dirty_ratio a few percent (in my case 2-3% is enough for a
> whole day) via cron job,wait a bit and then up again to it's normal value. 
> 
> That way flushes won't normally happen at all during your peak usage
> times, though in my case that's purely cosmetic, flushes are not
> problematic at any time in that cluster currently.
> 
>> Are there any controls which can influence this behavior?
>> 
> See above (cache_target_dirty_high_ratio).
> 
> Aside from that you might want to reflect on what your use case, workload
> is going to be and how your testing reflects on it.
> 
> As in, are you really going to write MASSIVE amounts of data at very high
> speeds or is it (like in 90% of common cases) the amount of small
> write IOPS that is really going to be the limiting factor. 
> Which is something that cache tiers can deal with very well (or
> sufficiently large and well designed "plain" clusters).
> 
> Another thing to think about is using the "readforward" cache mode,
> leaving your cache tier free to just handle writes and thus giving it more
> space to work with.
> 
> Christian
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gol.com_&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=D1RK9OOi5QmOFzGURxC82nkUr7mAe2-Ifo2FNgqYVQY&e= 

----------------------------------------------------------------------
The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com