On Mon, 5 Jun 2017 15:32:00 +0800 TYLin wrote: > Hi Christian, > > Thanks for you quick reply. > > > > On Jun 5, 2017, at 2:01 PM, Christian Balzer <chibi@xxxxxxx> wrote: > > > > > > Hello, > > > > On Mon, 5 Jun 2017 12:25:25 +0800 TYLin wrote: > > > >> Hi all, > >> > >> We’re using cache-tier with write-back mode but the write throughput is not as good as we expect. > > > > Numbers (what did you see and what did you expect?), versions, cluster > > HW/SW, etc etc. > > > > We use kraken 11.2.0. Our cluster has 8 nodes and each node consists of 7 HDD for storage pool (cephfs data and metadata), 3 ssd for data pool cache, 1 ssd for metadata pool cache. Public network and cluster network use same 10G NIC interface. We mount cephfs with kernel client on one of the nodes and use dd/fio to test its performance. The throughput of creating new file is about 400MB/s. However, the throughput of overwriting an existing file can reach more than 800MB/s. In our thoughts, the throughput of creating a new file and overwriting an existing file should not have that much difference. > Personally I avoid odd numbered releases, but my needs for stability and low update frequency seem to be far off the scale for "normal" Ceph users. W/o precise numbers of files and the size of your SSDs (which type?) it is hard to say, but you're likely to be better off just having all metadata on an SSD pool instead of cache-tiering. 800MB/s sounds about right for your network and cluster in general (no telling for sure w/o SSD/HDD details of course). As I pointed out before and will try to explain again below, that speed difference, while pretty daunting, isn't all that surprising. > > >> We use CephFS and create a 20GB file in it. While data is writing, we use iostat to get the disk statistics. From iostat, we saw that ssd (cache-tier) is idle most of the time and hdd (storage-tier) is busy all the time. From the document > > > > While having no real experience with CephFS (with or w/o cache-tiers), I > > do think I know what you're seeing here, see below. > > > >> > >> “When admins configure tiers with writeback mode, Ceph clients write data to the cache tier and receive an ACK from the cache tier. In time, the data written to the cache tier migrates to the storage tier and gets flushed from the cache tier.” > >> > >> So the data is write to cache-tier and then flush to storage tier when dirty ratio is more than 0.4? The word “in time” in the document confused me. > >> > >> We found that the throughput of creating a new file is slower than overwrite an existing file, and ssd has more write when doing overwrite. We then look into the source code and log. A newly created file goes to proxy_write, which is followed by a promote_object. Does this means that the object actually goes to storage pool directly and then be promoted to the cache-tier when creating a new file? > >> > > > > Creating a new file means creating new Ceph objects, which need to be > > present on both the backing store and the cache-tier. > > That overhead of creating them is the difference in time you see. > > The actual data of the initial write will still be only on the cache-tier, > > btw. > > You mean that when we create a new object, client will not get ACK until the data is written to storage pool (only journal?) and then promote to cache-tier ? If this is true, why we should wait until the object be written to both storage pool and cache-tier ? Can we use any configuration to force it write to cache-tier only and then flush to storage pool when the dirty ratio is reached? Just as what happened when overwrite an existing file. > No, not quite. Re-read what I wrote, there's a difference between RADOS object creation and actual data (contents). The devs or other people with more code familiarity will correct me, but essentially as I understand it this happens when a new RADOS object gets created in conjunction with a cache-tier: 1. Client (cephfs, rbd, whatever) talks to the cache-tier and the transaction causes a new object to be created. Since the tier is an overlay of the actual backing storage, the object (but not necessarily the curent data in it) needs to exist on both. 2. Object gets created on backing storage which involves creating the file (at zero length), any needed directories above and the entry in the OMAP leveldb. All on HDDs, all slow. I'm pretty sure this needs to be done and finished before the object is usable, no journals to speed this up. 3. Cache-tier pseudo-promotes the new object (it is empty after all) and starts accepting writes. This is leaving out any metadata stuff CephFS needs to do for new "blocks" and files, which may also be more involved than overwrites. Christian > > > > Once a file exists and is properly (not sparsely) allocated, writes should > > indeed just go to the cache-tier until flushing (space/time/object#) > > becomes necessary. > > That of course also requires the cache being big enough and not too busy > > so that things stay actually in it. > > Otherwise those objects need to be promoted back in from the HDDs, making > > things slow again. > > > > Tuning a cache-tier (both parameters and size in general) isn't easy and > > with some workloads pretty impossible to get desirable results. > > > > > > Christian > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Rakuten Communications > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com