On 17/03/2023 16:55, Ashu Pachauri wrote:
Hi Xiubo,
As you have correctly pointed out, I was talking about the stipe_unit
setting in the file layout configuration. Here is the documentation
for that for anyone else's reference:
https://docs.ceph.com/en/quincy/cephfs/file-layouts/
As with any RAID0 setup, the stripe_unit is definitely workload
dependent. Our use case requires us to read somewhere from a few
kilobytes to a few hundred kilobytes at once. Having a 4MB default
stripe_unit definitely hurts quite a bit. We were able to achieve
almost 2x improvement in terms of average latency and overall
throughput (for useful data) by reducing the stripe_unit. The rule of
thumb is that you want to align the stripe_unit to your most common IO
size.
Yeah, IMO this is ugly anyway. This is not as wise as the mm layer.
> BTW, have you tried to set 'rasize' option to a small size instead of 0
> ? Won't this work ?
No this won't work. I have tried it already. Since rasize simply
impacts readahead, your minimum io size to the cephfs client will
still be at the maximum of (rasize, stripe_unit). rasize is a useful
configuration only if it is required to be larger than the
stripe_unit, otherwise it's not. Also, it's worth pointing out that
simply setting rasize is not sufficient; one needs to change the
corresponding configurations that control maximum/minimum readahead
for ceph clients.
Yeah, this should work for old kernels before the
ceph_netfs_expand_readahead() being introduced.
I will improve it next week.
Thanks for your reporting about this.
Thanks
- Xiubo
Thanks and Regards,
Ashu Pachauri
On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote:
On 15/03/2023 17:20, Frank Schilder wrote:
> Hi Ashu,
>
> are you talking about the kernel client? I can't find "stripe
size" anywhere in its mount-documentation. Could you possibly post
exactly what you did? Mount fstab line, config setting?
There is no mount option to do this in both userspace and kernel
clients. You need to change the file layout, which is (4MB
stripe_unit,
1 stripe_count and 4MB object_size) by default, instead.
Certainly with a smaller size of the stripe_unit will work. But
IMO it
will depend and be careful, changing the layout may cause other
performance issues in some case, for example too small stripe_unit
size
may split the sync read into more osd requests to different OSDs.
I will generate one patch to make the kernel client wiser instead of
blindly setting it to stripe_unit always.
Thanks
- Xiubo
>
> Thanks!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Ashu Pachauri <ashu210890@xxxxxxxxx>
> Sent: 14 March 2023 19:23:42
> To: ceph-users@xxxxxxx
> Subject: Re: CephFS thrashing through the page cache
>
> Got the answer to my own question; posting here if someone else
> encounters the same problem. The issue is that the default
stripe size in a
> cephfs mount is 4 MB. If you are doing small reads (like 4k
reads in the
> test I posted) inside the file, you'll end up pulling at least
4MB to the
> client (and then discarding most of the pulled data) even if you set
> readahead to zero. So, the solution for us was to set a lower
stripe size,
> which aligns better with our workloads.
>
> Thanks and Regards,
> Ashu Pachauri
>
>
> On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri
<ashu210890@xxxxxxxxx> wrote:
>
>> Also, I am able to reproduce the network read amplification
when I try to
>> do very small reads from larger files. e.g.
>>
>> for i in $(seq 1 10000); do
>> dd if=test_${i} of=/dev/null bs=5k count=10
>> done
>>
>>
>> This piece of code generates a network traffic of 3.3 GB while
it actually
>> reads approx 500 MB of data.
>>
>>
>> Thanks and Regards,
>> Ashu Pachauri
>>
>> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri
<ashu210890@xxxxxxxxx>
>> wrote:
>>
>>> We have an internal use case where we back the storage of a
proprietary
>>> database by a shared file system. We noticed something very
odd when
>>> testing some workload with a local block device backed file
system vs
>>> cephfs. We noticed that the amount of network IO done by
cephfs is almost
>>> double compared to the IO done in case of a local file system
backed by an
>>> attached block device.
>>>
>>> We also noticed that CephFS thrashes through the page cache
very quickly
>>> compared to the amount of data being read and think that the
two issues
>>> might be related. So, I wrote a simple test.
>>>
>>> 1. I wrote 10k files 400KB each using dd (approx 4 GB data).
>>> 2. I dropped the page cache completely.
>>> 3. I then read these files serially, again using dd. The page
cache usage
>>> shot up to 39 GB for reading such a small amount of data.
>>>
>>> Following is the code used to repro this in bash:
>>>
>>> for i in $(seq 1 10000); do
>>> dd if=/dev/zero of=test_${i} bs=4k count=100
>>> done
>>>
>>> sync; echo 1 > /proc/sys/vm/drop_caches
>>>
>>> for i in $(seq 1 10000); do
>>> dd if=test_${i} of=/dev/null bs=4k count=100
>>> done
>>>
>>>
>>> The ceph version being used is:
>>> ceph version 15.2.13
(c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
>>> (stable)
>>>
>>> The ceph configs being overriden:
>>> WHO MASK LEVEL OPTION VALUE
>>> RO
>>> mon advanced
auth_allow_insecure_global_id_reclaim false
>>>
>>> mgr advanced mgr/balancer/mode upmap
>>>
>>> mgr advanced mgr/dashboard/server_addr
>>> 127.0.0.1 *
>>> mgr advanced mgr/dashboard/server_port
8443
>>> *
>>> mgr advanced mgr/dashboard/ssl false
>>> *
>>> mgr advanced mgr/prometheus/server_addr
0.0.0.0
>>> *
>>> mgr advanced mgr/prometheus/server_port
9283
>>> *
>>> osd advanced bluestore_compression_algorithm
lz4
>>>
>>> osd advanced bluestore_compression_mode
>>> aggressive
>>> osd advanced bluestore_throttle_bytes
>>> 536870912
>>> osd advanced osd_max_backfills 3
>>>
>>> osd advanced osd_op_num_threads_per_shard_ssd 8
>>> *
>>> osd advanced osd_scrub_auto_repair
true
>>>
>>> mds advanced client_oc false
>>>
>>> mds advanced client_readahead_max_bytes
4096
>>>
>>> mds advanced client_readahead_max_periods 1
>>>
>>> mds advanced client_readahead_min 0
>>>
>>> mds basic mds_cache_memory_limit
>>> 21474836480
>>> client advanced client_oc false
>>>
>>> client advanced client_readahead_max_bytes
4096
>>>
>>> client advanced client_readahead_max_periods 1
>>>
>>> client advanced client_readahead_min 0
>>>
>>> client advanced fuse_disable_pagecache
false
>>>
>>>
>>> The cephfs mount options (note that readahead was disabled for
this test):
>>> /mnt/cephfs type ceph
>>> (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0)
>>>
>>> Any help or pointers are appreciated; this is a major
performance issue
>>> for us.
>>>
>>>
>>> Thanks and Regards,
>>> Ashu Pachauri
>>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
--
Best Regards,
Xiubo Li (李秀波)
Email: xiubli@xxxxxxxxxx/xiubli@xxxxxxx
<http://xiubli@xxxxxxxxxx/xiubli@xxxxxxx>
Slack: @Xiubo Li
--
Best Regards,
Xiubo Li (李秀波)
Email:xiubli@xxxxxxxxxx/xiubli@xxxxxxx
Slack: @Xiubo Li
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx