On 15/03/2023 17:20, Frank Schilder wrote:
Hi Ashu,
are you talking about the kernel client? I can't find "stripe size" anywhere in its mount-documentation. Could you possibly post exactly what you did? Mount fstab line, config setting?
There is no mount option to do this in both userspace and kernel
clients. You need to change the file layout, which is (4MB stripe_unit,
1 stripe_count and 4MB object_size) by default, instead.
Certainly with a smaller size of the stripe_unit will work. But IMO it
will depend and be careful, changing the layout may cause other
performance issues in some case, for example too small stripe_unit size
may split the sync read into more osd requests to different OSDs.
I will generate one patch to make the kernel client wiser instead of
blindly setting it to stripe_unit always.
Thanks
- Xiubo
Thanks!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Ashu Pachauri <ashu210890@xxxxxxxxx>
Sent: 14 March 2023 19:23:42
To: ceph-users@xxxxxxx
Subject: Re: CephFS thrashing through the page cache
Got the answer to my own question; posting here if someone else
encounters the same problem. The issue is that the default stripe size in a
cephfs mount is 4 MB. If you are doing small reads (like 4k reads in the
test I posted) inside the file, you'll end up pulling at least 4MB to the
client (and then discarding most of the pulled data) even if you set
readahead to zero. So, the solution for us was to set a lower stripe size,
which aligns better with our workloads.
Thanks and Regards,
Ashu Pachauri
On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri <ashu210890@xxxxxxxxx> wrote:
Also, I am able to reproduce the network read amplification when I try to
do very small reads from larger files. e.g.
for i in $(seq 1 10000); do
dd if=test_${i} of=/dev/null bs=5k count=10
done
This piece of code generates a network traffic of 3.3 GB while it actually
reads approx 500 MB of data.
Thanks and Regards,
Ashu Pachauri
On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri <ashu210890@xxxxxxxxx>
wrote:
We have an internal use case where we back the storage of a proprietary
database by a shared file system. We noticed something very odd when
testing some workload with a local block device backed file system vs
cephfs. We noticed that the amount of network IO done by cephfs is almost
double compared to the IO done in case of a local file system backed by an
attached block device.
We also noticed that CephFS thrashes through the page cache very quickly
compared to the amount of data being read and think that the two issues
might be related. So, I wrote a simple test.
1. I wrote 10k files 400KB each using dd (approx 4 GB data).
2. I dropped the page cache completely.
3. I then read these files serially, again using dd. The page cache usage
shot up to 39 GB for reading such a small amount of data.
Following is the code used to repro this in bash:
for i in $(seq 1 10000); do
dd if=/dev/zero of=test_${i} bs=4k count=100
done
sync; echo 1 > /proc/sys/vm/drop_caches
for i in $(seq 1 10000); do
dd if=test_${i} of=/dev/null bs=4k count=100
done
The ceph version being used is:
ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
(stable)
The ceph configs being overriden:
WHO MASK LEVEL OPTION VALUE
RO
mon advanced auth_allow_insecure_global_id_reclaim false
mgr advanced mgr/balancer/mode upmap
mgr advanced mgr/dashboard/server_addr
127.0.0.1 *
mgr advanced mgr/dashboard/server_port 8443
*
mgr advanced mgr/dashboard/ssl false
*
mgr advanced mgr/prometheus/server_addr 0.0.0.0
*
mgr advanced mgr/prometheus/server_port 9283
*
osd advanced bluestore_compression_algorithm lz4
osd advanced bluestore_compression_mode
aggressive
osd advanced bluestore_throttle_bytes
536870912
osd advanced osd_max_backfills 3
osd advanced osd_op_num_threads_per_shard_ssd 8
*
osd advanced osd_scrub_auto_repair true
mds advanced client_oc false
mds advanced client_readahead_max_bytes 4096
mds advanced client_readahead_max_periods 1
mds advanced client_readahead_min 0
mds basic mds_cache_memory_limit
21474836480
client advanced client_oc false
client advanced client_readahead_max_bytes 4096
client advanced client_readahead_max_periods 1
client advanced client_readahead_min 0
client advanced fuse_disable_pagecache false
The cephfs mount options (note that readahead was disabled for this test):
/mnt/cephfs type ceph
(rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0)
Any help or pointers are appreciated; this is a major performance issue
for us.
Thanks and Regards,
Ashu Pachauri
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
Best Regards,
Xiubo Li (李秀波)
Email: xiubli@xxxxxxxxxx/xiubli@xxxxxxx
Slack: @Xiubo Li
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx