Re: CephFS thrashing through the page cache

Xiubo Li <xiubli@xxxxxxxxxx> · Tue, 4 Apr 2023 17:04:49 +0800

Hi Ashu,

Yeah, please see 
https://patchwork.kernel.org/project/ceph-devel/list/?series=733010.

Sorry I forgot to reply it here.

- Xiubo

On 4/4/23 13:58, Ashu Pachauri wrote:
Hi Xiubo,

Did you get a chance to work on this? I am curious to test out the 
improvements.

Thanks and Regards,
Ashu Pachauri

On Fri, Mar 17, 2023 at 3:33 PM Frank Schilder <frans@xxxxxx> wrote:

    Hi Ashu,

    thanks for the clarification. That's not an option that is easy to
    change. I hope that the modifications to the fs clients Xiubo has
    in mind will improve that. Thanks for flagging this performance
    issue. Would be great if this becomes part of a test suite.

    Best regards,
    =================
    Frank Schilder
    AIT Risø Campus
    Bygning 109, rum S14

    ________________________________________
    From: Ashu Pachauri <ashu210890@xxxxxxxxx>
    Sent: 17 March 2023 09:55:25
    To: Xiubo Li
    Cc: Frank Schilder; ceph-users@xxxxxxx
    Subject: Re:  Re: CephFS thrashing through the page cache

    Hi Xiubo,

    As you have correctly pointed out, I was talking about the
    stipe_unit setting in the file layout configuration. Here is the
    documentation for that for anyone else's reference:
    https://docs.ceph.com/en/quincy/cephfs/file-layouts/

    As with any RAID0 setup, the stripe_unit is definitely workload
    dependent. Our use case requires us to read somewhere from a few
    kilobytes to a few hundred kilobytes at once. Having a 4MB default
    stripe_unit definitely hurts quite a bit. We were able to achieve
    almost 2x improvement in terms of average latency and overall
    throughput (for useful data) by reducing the stripe_unit. The rule
    of thumb is that you want to align the stripe_unit to your most
    common IO size.

    > BTW, have you tried to set 'rasize' option to a small size
    instead of 0
    > ? Won't this work ?

    No this won't work. I have tried it already. Since rasize simply
    impacts readahead, your minimum io size to the cephfs client will
    still be at the maximum of (rasize, stripe_unit). rasize is a
    useful configuration only if it is required to be larger than the
    stripe_unit, otherwise it's not. Also, it's worth pointing out
    that simply setting rasize is not sufficient; one needs to change
    the corresponding configurations that control maximum/minimum
    readahead for ceph clients.

    Thanks and Regards,
    Ashu Pachauri

    On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li
    <xiubli@xxxxxxxxxx<mailto:xiubli@xxxxxxxxxx>> wrote:

    On 15/03/2023 17:20, Frank Schilder wrote:
    > Hi Ashu,
    >
    > are you talking about the kernel client? I can't find "stripe
    size" anywhere in its mount-documentation. Could you possibly post
    exactly what you did? Mount fstab line, config setting?

    There is no mount option to do this in both userspace and kernel
    clients. You need to change the file layout, which is (4MB
    stripe_unit,
    1 stripe_count and 4MB object_size) by default, instead.

    Certainly with a smaller size of the stripe_unit will work. But IMO it
    will depend and be careful, changing the layout may cause other
    performance issues in some case, for example too small stripe_unit
    size
    may split the sync read into more osd requests to different OSDs.

    I will generate one patch to make the kernel client wiser instead of
    blindly setting it to stripe_unit always.

    Thanks

    - Xiubo

    >
    > Thanks!
    > =================
    > Frank Schilder
    > AIT Risø Campus
    > Bygning 109, rum S14
    >
    > ________________________________________
    > From: Ashu Pachauri
    <ashu210890@xxxxxxxxx<mailto:ashu210890@xxxxxxxxx>>
    > Sent: 14 March 2023 19:23:42
    > To: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
    > Subject:  Re: CephFS thrashing through the page cache
    >
    > Got the answer to my own question; posting here if someone else
    > encounters the same problem. The issue is that the default
    stripe size in a
    > cephfs mount is 4 MB. If you are doing small reads (like 4k
    reads in the
    > test I posted) inside the file, you'll end up pulling at least
    4MB to the
    > client (and then discarding most of the pulled data) even if you set
    > readahead to zero. So, the solution for us was to set a lower
    stripe size,
    > which aligns better with our workloads.
    >
    > Thanks and Regards,
    > Ashu Pachauri
    >
    >
    > On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri
    <ashu210890@xxxxxxxxx<mailto:ashu210890@xxxxxxxxx>> wrote:
    >
    >> Also, I am able to reproduce the network read amplification
    when I try to
    >> do very small reads from larger files. e.g.
    >>
    >> for i in $(seq 1 10000); do
    >>    dd if=test_${i} of=/dev/null bs=5k count=10
    >> done
    >>
    >>
    >> This piece of code generates a network traffic of 3.3 GB while
    it actually
    >> reads approx 500 MB of data.
    >>
    >>
    >> Thanks and Regards,
    >> Ashu Pachauri
    >>
    >> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri
    <ashu210890@xxxxxxxxx<mailto:ashu210890@xxxxxxxxx>>
    >> wrote:
    >>
    >>> We have an internal use case where we back the storage of a
    proprietary
    >>> database by a shared file system. We noticed something very
    odd when
    >>> testing some workload with a local block device backed file
    system vs
    >>> cephfs. We noticed that the amount of network IO done by
    cephfs is almost
    >>> double compared to the IO done in case of a local file system
    backed by an
    >>> attached block device.
    >>>
    >>> We also noticed that CephFS thrashes through the page cache
    very quickly
    >>> compared to the amount of data being read and think that the
    two issues
    >>> might be related. So, I wrote a simple test.
    >>>
    >>> 1. I wrote 10k files 400KB each using dd (approx 4 GB data).
    >>> 2. I dropped the page cache completely.
    >>> 3. I then read these files serially, again using dd. The page
    cache usage
    >>> shot up to 39 GB for reading such a small amount of data.
    >>>
    >>> Following is the code used to repro this in bash:
    >>>
    >>> for i in $(seq 1 10000); do
    >>>    dd if=/dev/zero of=test_${i} bs=4k count=100
    >>> done
    >>>
    >>> sync; echo 1 > /proc/sys/vm/drop_caches
    >>>
    >>> for i in $(seq 1 10000); do
    >>>    dd if=test_${i} of=/dev/null bs=4k count=100
    >>> done
    >>>
    >>>
    >>> The ceph version being used is:
    >>> ceph version 15.2.13
    (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
    >>> (stable)
    >>>
    >>> The ceph configs being overriden:
    >>> WHO       MASK  LEVEL     OPTION                VALUE
    >>>       RO
    >>>    mon           advanced
    auth_allow_insecure_global_id_reclaim  false
    >>>
    >>>    mgr           advanced  mgr/balancer/mode                 upmap
    >>>
    >>>    mgr           advanced mgr/dashboard/server_addr
    >>>   127.0.0.1    *
    >>>    mgr           advanced mgr/dashboard/server_port           
      8443
    >>>      *
    >>>    mgr           advanced  mgr/dashboard/ssl                 false
    >>>       *
    >>>    mgr           advanced mgr/prometheus/server_addr         
       0.0.0.0
    >>>       *
    >>>    mgr           advanced mgr/prometheus/server_port         
       9283
    >>>      *
    >>>    osd           advanced bluestore_compression_algorithm     
      lz4
    >>>
    >>>    osd           advanced bluestore_compression_mode
    >>> aggressive
    >>>    osd           advanced bluestore_throttle_bytes
    >>> 536870912
    >>>    osd           advanced  osd_max_backfills                 3
    >>>
    >>>    osd           advanced osd_op_num_threads_per_shard_ssd       8
    >>>       *
    >>>    osd           advanced  osd_scrub_auto_repair              
      true
    >>>
    >>>    mds           advanced  client_oc                 false
    >>>
    >>>    mds           advanced client_readahead_max_bytes         
       4096
    >>>
    >>>    mds           advanced client_readahead_max_periods           1
    >>>
    >>>    mds           advanced  client_readahead_min                  0
    >>>
    >>>    mds           basic     mds_cache_memory_limit
    >>> 21474836480
    >>>    client        advanced  client_oc                 false
    >>>
    >>>    client        advanced client_readahead_max_bytes         
       4096
    >>>
    >>>    client        advanced client_readahead_max_periods           1
    >>>
    >>>    client        advanced  client_readahead_min                  0
    >>>
    >>>    client        advanced fuse_disable_pagecache             
       false
    >>>
    >>>
    >>> The cephfs mount options (note that readahead was disabled for
    this test):
    >>> /mnt/cephfs type ceph
    >>> (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0)
    >>>
    >>> Any help or pointers are appreciated; this is a major
    performance issue
    >>> for us.
    >>>
    >>>
    >>> Thanks and Regards,
    >>> Ashu Pachauri
    >>>
    > _______________________________________________
    > ceph-users mailing list --
    ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
    > To unsubscribe send an email to
    ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
    > _______________________________________________
    > ceph-users mailing list --
    ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
    > To unsubscribe send an email to
    ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
    >
    --
    Best Regards,

    Xiubo Li (李秀波)

    Email: xiubli@xxxxxxxxxx/xiubli@xxxxxxx
    <http://xiubli@xxxxxxxxxx/xiubli@xxxxxxx><http://xiubli@xxxxxxxxxx/xiubli@xxxxxxx>
    Slack: @Xiubo Li

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx