Re: CephFS thrashing through the page cache

Ashu Pachauri <ashu210890@xxxxxxxxx> · Fri, 17 Mar 2023 14:25:25 +0530

Hi Xiubo,

As you have correctly pointed out, I was talking about the stipe_unit
setting in the file layout configuration. Here is the documentation for
that for anyone else's reference:
https://docs.ceph.com/en/quincy/cephfs/file-layouts/

As with any RAID0 setup, the stripe_unit is definitely workload dependent.
Our use case requires us to read somewhere from a few kilobytes to a few
hundred kilobytes at once. Having a 4MB default stripe_unit definitely
hurts quite a bit. We were able to achieve almost 2x improvement in terms
of average latency and overall throughput (for useful data) by reducing the
stripe_unit. The rule of thumb is that you want to align the stripe_unit to
your most common IO size.

> BTW, have you tried to set 'rasize' option to a small size instead of 0
> ? Won't this work ?

No this won't work. I have tried it already. Since rasize simply impacts
readahead, your minimum io size to the cephfs client will still be at the
maximum of (rasize, stripe_unit).  rasize is a useful configuration only if
it is required to be larger than the stripe_unit, otherwise it's not. Also,
it's worth pointing out that simply setting rasize is not sufficient; one
needs to change the corresponding configurations that control
maximum/minimum readahead for ceph clients.

Thanks and Regards,
Ashu Pachauri

On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote:

>
> On 15/03/2023 17:20, Frank Schilder wrote:
> > Hi Ashu,
> >
> > are you talking about the kernel client? I can't find "stripe size"
> anywhere in its mount-documentation. Could you possibly post exactly what
> you did? Mount fstab line, config setting?
>
> There is no mount option to do this in both userspace and kernel
> clients. You need to change the file layout, which is (4MB stripe_unit,
> 1 stripe_count and 4MB object_size) by default, instead.
>
> Certainly with a smaller size of the stripe_unit will work. But IMO it
> will depend and be careful, changing the layout may cause other
> performance issues in some case, for example too small stripe_unit size
> may split the sync read into more osd requests to different OSDs.
>
> I will generate one patch to make the kernel client wiser instead of
> blindly setting it to stripe_unit always.
>
> Thanks
>
> - Xiubo
>
>
> >
> > Thanks!
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Ashu Pachauri <ashu210890@xxxxxxxxx>
> > Sent: 14 March 2023 19:23:42
> > To: ceph-users@xxxxxxx
> > Subject:  Re: CephFS thrashing through the page cache
> >
> > Got the answer to my own question; posting here if someone else
> > encounters the same problem. The issue is that the default stripe size
> in a
> > cephfs mount is 4 MB. If you are doing small reads (like 4k reads in the
> > test I posted) inside the file, you'll end up pulling at least 4MB to the
> > client (and then discarding most of the pulled data) even if you set
> > readahead to zero. So, the solution for us was to set a lower stripe
> size,
> > which aligns better with our workloads.
> >
> > Thanks and Regards,
> > Ashu Pachauri
> >
> >
> > On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri <ashu210890@xxxxxxxxx>
> wrote:
> >
> >> Also, I am able to reproduce the network read amplification when I try
> to
> >> do very small reads from larger files. e.g.
> >>
> >> for i in $(seq 1 10000); do
> >>    dd if=test_${i} of=/dev/null bs=5k count=10
> >> done
> >>
> >>
> >> This piece of code generates a network traffic of 3.3 GB while it
> actually
> >> reads approx 500 MB of data.
> >>
> >>
> >> Thanks and Regards,
> >> Ashu Pachauri
> >>
> >> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri <ashu210890@xxxxxxxxx>
> >> wrote:
> >>
> >>> We have an internal use case where we back the storage of a proprietary
> >>> database by a shared file system. We noticed something very odd when
> >>> testing some workload with a local block device backed file system vs
> >>> cephfs. We noticed that the amount of network IO done by cephfs is
> almost
> >>> double compared to the IO done in case of a local file system backed
> by an
> >>> attached block device.
> >>>
> >>> We also noticed that CephFS thrashes through the page cache very
> quickly
> >>> compared to the amount of data being read and think that the two issues
> >>> might be related. So, I wrote a simple test.
> >>>
> >>> 1. I wrote 10k files 400KB each using dd (approx 4 GB data).
> >>> 2. I dropped the page cache completely.
> >>> 3. I then read these files serially, again using dd. The page cache
> usage
> >>> shot up to 39 GB for reading such a small amount of data.
> >>>
> >>> Following is the code used to repro this in bash:
> >>>
> >>> for i in $(seq 1 10000); do
> >>>    dd if=/dev/zero of=test_${i} bs=4k count=100
> >>> done
> >>>
> >>> sync; echo 1 > /proc/sys/vm/drop_caches
> >>>
> >>> for i in $(seq 1 10000); do
> >>>    dd if=test_${i} of=/dev/null bs=4k count=100
> >>> done
> >>>
> >>>
> >>> The ceph version being used is:
> >>> ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
> >>> (stable)
> >>>
> >>> The ceph configs being overriden:
> >>> WHO       MASK  LEVEL     OPTION                                 VALUE
> >>>       RO
> >>>    mon           advanced  auth_allow_insecure_global_id_reclaim  false
> >>>
> >>>    mgr           advanced  mgr/balancer/mode                      upmap
> >>>
> >>>    mgr           advanced  mgr/dashboard/server_addr
> >>>   127.0.0.1    *
> >>>    mgr           advanced  mgr/dashboard/server_port              8443
> >>>      *
> >>>    mgr           advanced  mgr/dashboard/ssl                      false
> >>>       *
> >>>    mgr           advanced  mgr/prometheus/server_addr
>  0.0.0.0
> >>>       *
> >>>    mgr           advanced  mgr/prometheus/server_port             9283
> >>>      *
> >>>    osd           advanced  bluestore_compression_algorithm        lz4
> >>>
> >>>    osd           advanced  bluestore_compression_mode
> >>> aggressive
> >>>    osd           advanced  bluestore_throttle_bytes
> >>> 536870912
> >>>    osd           advanced  osd_max_backfills                      3
> >>>
> >>>    osd           advanced  osd_op_num_threads_per_shard_ssd       8
> >>>       *
> >>>    osd           advanced  osd_scrub_auto_repair                  true
> >>>
> >>>    mds           advanced  client_oc                              false
> >>>
> >>>    mds           advanced  client_readahead_max_bytes             4096
> >>>
> >>>    mds           advanced  client_readahead_max_periods           1
> >>>
> >>>    mds           advanced  client_readahead_min                   0
> >>>
> >>>    mds           basic     mds_cache_memory_limit
> >>> 21474836480
> >>>    client        advanced  client_oc                              false
> >>>
> >>>    client        advanced  client_readahead_max_bytes             4096
> >>>
> >>>    client        advanced  client_readahead_max_periods           1
> >>>
> >>>    client        advanced  client_readahead_min                   0
> >>>
> >>>    client        advanced  fuse_disable_pagecache                 false
> >>>
> >>>
> >>> The cephfs mount options (note that readahead was disabled for this
> test):
> >>> /mnt/cephfs type ceph
> >>> (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0)
> >>>
> >>> Any help or pointers are appreciated; this is a major performance issue
> >>> for us.
> >>>
> >>>
> >>> Thanks and Regards,
> >>> Ashu Pachauri
> >>>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> --
> Best Regards,
>
> Xiubo Li (李秀波)
>
> Email: xiubli@xxxxxxxxxx/xiubli@xxxxxxx
> Slack: @Xiubo Li
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx