Re: Benchmark WAL/DB on SSD and HDD for RGW RBD CephFS

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sun, 20 Sep 2020 13:49:59 +0200

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/md/dm-writecache.c?h=v5.8.10&id=c1005322ff02110a4df7f0033368ea015062b583

On 19/09/2020 10:31, huxiaoyu@xxxxxxxxxxxx wrote:
Dear Maged,

Thanks a lot for detailed explanantion on dm-writecache with Ceph.

You mentioned REQ_FUA support patch for dm-writecache, does such a 
patch not included into recent dm-writecache source code? I am using 
4.4 and 4.15/4.19 kernels, where do i get the mentioned patch?

best regards,

Samuel

------------------------------------------------------------------------
huxiaoyu@xxxxxxxxxxxx

    *From:* Maged Mokhtar <mailto:mmokhtar@xxxxxxxxxxx>
    *Date:* 2020-09-18 18:20
    *To:* vitalif <mailto:vitalif@xxxxxxxxxx>; huxiaoyu
    <mailto:huxiaoyu@xxxxxxxxxxxx>; ceph-users <mailto:ceph-users@xxxxxxx>
    *Subject:* Re:  Re: Benchmark WAL/DB on SSD and HDD
    for RGW RBD CephFS
    dm-writecache works using a high and low watermarks, set at 45 and
    50%.
    All writes land in cache, once cache fills to the high watermark
    backfilling to the slow device starts and stops when reaching the low
    watermark. Backfilling uses b-tree with LRU blocks and tries merge
    blocks to reduce hdd seeks, this is further helped by the io
    scheduler
    (cfq/deadline) ordering.
    Each sync write op to the device requires 2 sync write ops, one
    for data
    and one for metadata, metadata is always in ram so there is no
    additional metada read op (at the expense of using 2.5% of your cache
    partition size in ram). So for pure sync writes (those with
    REQ_FUA or
    REQ_FLUSH which is used by Ceph) get half the SSD iops performance at
    the device level.
    Now the questions, what sustained performance would you get during
    backfilling: it totally depends on whether your workload is
    sequential
    or random. For pure sequential workloads, all blocks are merged so
    there
    will not be a drop in input iops and backfilling occurs in small step
    like intervals, but for such workloads you could get good performance
    even without a cache. For purely random writes theoretically you
    should
    drop to the hdd random iops speed ( ie 80-150 iops ), but in our
    testing
    with fio pure random we would get 400-450 sustained iops, this is
    probably related to the non-random-ness of fio rather than any magic.
    For real life workloads that have a mix of both, this is where the
    real
    benefit of the cache will be felt, however it is not easy to simulate
    such workloads, fio does offer a zipf/theta random distribution
    control
    but it was difficult for us to simulate real life workloads with
    it, we
    did some manual workloads such as installing and copying multiple vms
    and we found the cache helped by 3-4 times the time to complete.
    dm-writecache does serve reads if in cache, however the OSD cache
    does
    help for reads as well as any client read-ahead and in general writes
    are the performance issue with hdd in Ceph.
    For bcache, the only configuration we did was to enable write back
    mode,
    we did not set the block size to 4k.
    If you want to try dm-writecache, use a recent 5.4+ kernel or a
    kernel
    with REQ_FUA support patch we did. You would need a recent lvm tools
    package to support dm-writecache. We also limit the number of
    backfill
    blocks inflight to 100k blocks ie 400 MB.
    /Maged
    On 18/09/2020 13:38, vitalif@xxxxxxxxxx wrote:
    >> we did test dm-cache, bcache and dm-writecache, we found the
    later to be
    >> much better.
    > Did you set bcache block size to 4096 during your tests? Without
    this setting it's slow because 99.9% SSDs don't handle 512 byte
    overwrites well. Otherwise I don't think bcache should be worse
    than dm-writecache. Also dm-writecache only caches writes, and
    bcache also caches reads. And lvmcache is trash because it only
    writes to SSD when the block is already on the SSD.
    >
    > Please post some details about the comparison if you have them :)

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx