Hi Xiubo, As you have correctly pointed out, I was talking about the stipe_unit setting in the file layout configuration. Here is the documentation for that for anyone else's reference: https://docs.ceph.com/en/quincy/cephfs/file-layouts/ As with any RAID0 setup, the stripe_unit is definitely workload dependent. Our use case requires us to read somewhere from a few kilobytes to a few hundred kilobytes at once. Having a 4MB default stripe_unit definitely hurts quite a bit. We were able to achieve almost 2x improvement in terms of average latency and overall throughput (for useful data) by reducing the stripe_unit. The rule of thumb is that you want to align the stripe_unit to your most common IO size. > BTW, have you tried to set 'rasize' option to a small size instead of 0 > ? Won't this work ? No this won't work. I have tried it already. Since rasize simply impacts readahead, your minimum io size to the cephfs client will still be at the maximum of (rasize, stripe_unit). rasize is a useful configuration only if it is required to be larger than the stripe_unit, otherwise it's not. Also, it's worth pointing out that simply setting rasize is not sufficient; one needs to change the corresponding configurations that control maximum/minimum readahead for ceph clients. Thanks and Regards, Ashu Pachauri On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > > On 15/03/2023 17:20, Frank Schilder wrote: > > Hi Ashu, > > > > are you talking about the kernel client? I can't find "stripe size" > anywhere in its mount-documentation. Could you possibly post exactly what > you did? Mount fstab line, config setting? > > There is no mount option to do this in both userspace and kernel > clients. You need to change the file layout, which is (4MB stripe_unit, > 1 stripe_count and 4MB object_size) by default, instead. > > Certainly with a smaller size of the stripe_unit will work. But IMO it > will depend and be careful, changing the layout may cause other > performance issues in some case, for example too small stripe_unit size > may split the sync read into more osd requests to different OSDs. > > I will generate one patch to make the kernel client wiser instead of > blindly setting it to stripe_unit always. > > Thanks > > - Xiubo > > > > > > Thanks! > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Ashu Pachauri <ashu210890@xxxxxxxxx> > > Sent: 14 March 2023 19:23:42 > > To: ceph-users@xxxxxxx > > Subject: Re: CephFS thrashing through the page cache > > > > Got the answer to my own question; posting here if someone else > > encounters the same problem. The issue is that the default stripe size > in a > > cephfs mount is 4 MB. If you are doing small reads (like 4k reads in the > > test I posted) inside the file, you'll end up pulling at least 4MB to the > > client (and then discarding most of the pulled data) even if you set > > readahead to zero. So, the solution for us was to set a lower stripe > size, > > which aligns better with our workloads. > > > > Thanks and Regards, > > Ashu Pachauri > > > > > > On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri <ashu210890@xxxxxxxxx> > wrote: > > > >> Also, I am able to reproduce the network read amplification when I try > to > >> do very small reads from larger files. e.g. > >> > >> for i in $(seq 1 10000); do > >> dd if=test_${i} of=/dev/null bs=5k count=10 > >> done > >> > >> > >> This piece of code generates a network traffic of 3.3 GB while it > actually > >> reads approx 500 MB of data. > >> > >> > >> Thanks and Regards, > >> Ashu Pachauri > >> > >> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri <ashu210890@xxxxxxxxx> > >> wrote: > >> > >>> We have an internal use case where we back the storage of a proprietary > >>> database by a shared file system. We noticed something very odd when > >>> testing some workload with a local block device backed file system vs > >>> cephfs. We noticed that the amount of network IO done by cephfs is > almost > >>> double compared to the IO done in case of a local file system backed > by an > >>> attached block device. > >>> > >>> We also noticed that CephFS thrashes through the page cache very > quickly > >>> compared to the amount of data being read and think that the two issues > >>> might be related. So, I wrote a simple test. > >>> > >>> 1. I wrote 10k files 400KB each using dd (approx 4 GB data). > >>> 2. I dropped the page cache completely. > >>> 3. I then read these files serially, again using dd. The page cache > usage > >>> shot up to 39 GB for reading such a small amount of data. > >>> > >>> Following is the code used to repro this in bash: > >>> > >>> for i in $(seq 1 10000); do > >>> dd if=/dev/zero of=test_${i} bs=4k count=100 > >>> done > >>> > >>> sync; echo 1 > /proc/sys/vm/drop_caches > >>> > >>> for i in $(seq 1 10000); do > >>> dd if=test_${i} of=/dev/null bs=4k count=100 > >>> done > >>> > >>> > >>> The ceph version being used is: > >>> ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus > >>> (stable) > >>> > >>> The ceph configs being overriden: > >>> WHO MASK LEVEL OPTION VALUE > >>> RO > >>> mon advanced auth_allow_insecure_global_id_reclaim false > >>> > >>> mgr advanced mgr/balancer/mode upmap > >>> > >>> mgr advanced mgr/dashboard/server_addr > >>> 127.0.0.1 * > >>> mgr advanced mgr/dashboard/server_port 8443 > >>> * > >>> mgr advanced mgr/dashboard/ssl false > >>> * > >>> mgr advanced mgr/prometheus/server_addr > 0.0.0.0 > >>> * > >>> mgr advanced mgr/prometheus/server_port 9283 > >>> * > >>> osd advanced bluestore_compression_algorithm lz4 > >>> > >>> osd advanced bluestore_compression_mode > >>> aggressive > >>> osd advanced bluestore_throttle_bytes > >>> 536870912 > >>> osd advanced osd_max_backfills 3 > >>> > >>> osd advanced osd_op_num_threads_per_shard_ssd 8 > >>> * > >>> osd advanced osd_scrub_auto_repair true > >>> > >>> mds advanced client_oc false > >>> > >>> mds advanced client_readahead_max_bytes 4096 > >>> > >>> mds advanced client_readahead_max_periods 1 > >>> > >>> mds advanced client_readahead_min 0 > >>> > >>> mds basic mds_cache_memory_limit > >>> 21474836480 > >>> client advanced client_oc false > >>> > >>> client advanced client_readahead_max_bytes 4096 > >>> > >>> client advanced client_readahead_max_periods 1 > >>> > >>> client advanced client_readahead_min 0 > >>> > >>> client advanced fuse_disable_pagecache false > >>> > >>> > >>> The cephfs mount options (note that readahead was disabled for this > test): > >>> /mnt/cephfs type ceph > >>> (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0) > >>> > >>> Any help or pointers are appreciated; this is a major performance issue > >>> for us. > >>> > >>> > >>> Thanks and Regards, > >>> Ashu Pachauri > >>> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > -- > Best Regards, > > Xiubo Li (李秀波) > > Email: xiubli@xxxxxxxxxx/xiubli@xxxxxxx > Slack: @Xiubo Li > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx