Re: Newer linux kernel cephfs clients is more trouble?

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Fri, 9 Dec 2022 10:45:43 +0100

Hi,

On 07.12.22 11:58, Stefan Kooman wrote:
On 5/13/22 09:38, Xiubo Li wrote:

On 5/12/22 12:06 AM, Stefan Kooman wrote:
Hi List,

We have quite a few linux kernel clients for CephFS. One of our 
customers has been running mainline kernels (CentOS 7 elrepo) for 
the past two years. They started out with 3.x kernels (default 
CentOS 7), but upgraded to mainline when those kernels would 
frequently generate MDS warnings like "failing to respond to 
capability release". That worked fine until 5.14 kernel. 5.14 and up 
would use a lot of CPU and *way* more bandwidth on CephFS than older 
kernels (order of magnitude). After the MDS was upgraded from 
Nautilus to Octopus that behavior is gone (comparable CPU / 
bandwidth usage as older kernels). However, the newer kernels are 
now the ones that give "failing to respond to capability release", 
and worse, clients get evicted (unresponsive as far as the MDS is 
concerned). Even the latest 5.17 kernels have that. No difference is 
observed between using messenger v1 or v2. MDS version is 15.2.16.
Surprisingly the latest stable kernels from CentOS 7 work flawlessly 
now. Although that is good news, newer operating systems come with 
newer kernels.

Does anyone else observe the same behavior with newish kernel clients?

There have some known bugs, which have been fixed or under fixing 
recently, even in the mainline and, not sure whether are they 
related. Such as [1][2][3][4]. More detail please see ceph-client 
repo testing branch [5].

None of the issues you mentioned were related. We gained some more 
experience with newer kernel clients, specifically on Ubuntu Focal / 
Jammy (5.15). Performance issues seem to arise in certain workloads, 
specifically load-balanced Apache shared web hosting clusters with 
CephFS. We have tested linux kernel clients from 5.8 up to and 
including 6.0 with a production workload and the short summary is:

< 5.13, everything works fine
5.13 and up is giving issues

We tested the 5.13.-rc1 as well, and already that kernel is giving 
issues. So something has changed in 5.13 that results in performance 
regression in certain workloads. And I wonder if it has something to 
do with the changes related to fscache that have, and are, happening 
in the kernel. These web servers might access the same directories / 
files concurrently.

Note: we have quite a few 5.15 kernel clients not doing any 
(load-balanced) web based workload (container clusters on CephFS) that 
don't have any performance issue running these kernels.

Issue: poor CephFS performance
Symptom / result: excessive CephFS network usage (order of magnitude 
higher than for older kernels not having this issue), within a minute 
there are a bunch of slow web service processes, claiming loads of 
virtual memory, that result in heavy swap usage and basically 
rendering the node unusable slow.

Other users that replied to this thread experienced similar symptoms. 
It is reproducible on both CentOS (EPEL mainline kernels) as well as 
on Ubuntu (hwe as well as default relase kernel).

MDS version used: 15.2.16 (with a backported patch from 15.2.17) 
(single active / standby-replay)

we are making similar observations, running Ubuntu 20.04 with kernel 
5.15.0-56-generic (latest generic hwe). Ceph version is 16.2.10.

If I run a single dd or cp to write a file to a cephfs directory, 
performance itself is fine:

$ dd if=some_large_file of=bar bs=4k status=progress
10599006208 bytes (11 GB, 9.9 GiB) copied, 59 s, 179 MB/s
2621352+1 records in
2621352+1 records out
10737059840 bytes (11 GB, 10 GiB) copied, 59.5095 s, 180 MB/s

(4k blocksize is a test, performance for large bs ist better).

The troubling aspect is the cache state of both files. The host I'm 
testing with has more than enough free RAM, no other running processes, 
so there's no memory pressure at all. But both the input and output 
files are only partially cached:

$ fincore some_large_file
   RES  PAGES SIZE FILE
963.1M 246542  10G some_large_file
$ fincore bar
    RES  PAGES SIZE FILE
1017.8M 260540  10G bar

I would have expected that the whole files are cached, since there's no 
reason to evict cached data. The actual amount of cached data varies, e.g.

$ dd if=some_large_file of=/dev/null status=progress
10682888704 bytes (11 GB, 9.9 GiB) copied, 69 s, 155 MB/s
20970820+0 records in
20970820+0 records out
10737059840 bytes (11 GB, 10 GiB) copied, 69.3027 s, 155 MB/s
$ fincore some_large_file
  RES  PAGES SIZE FILE
 1.9G 494537  10G some_large_file

The worst case I'm currently observing are gzip jobs on our slurm 
cluster. They have been running for over 40 hours now with about 70 GB 
input and have progressed roughly 50%. Their output file is completely 
uncached:

# fincore compressed_input
  RES    PAGES  SIZE FILE
73.9G 19367998 73.9G compressed_input
# fincore uncompressed_output
RES PAGES  SIZE FILE
 0B     0 68.1G uncompressed_output

gzip is using small writes and is writing to STDOUT with is redirected 
to a file. I'm not sure whether any application level cache is used in 
this setup. The process is using just about 1% CPU, is blocked in D 
state most the time, and each write call in strace takes up to half a 
second:

# strace -t -r -p 2880242
strace: Process 2880242 attached
10:22:15 (+     0.000000) read(3, 
"Y\262\333\371\375\0336\307w0_\346\355\367\270\364\356\312\21)\27\264dtc\341\210\21\16\304o\30"..., 
262144) = 262144
10:22:15 (+     0.000621) write(1, 
"@f8729197-00e1-4bb5-bd0f-da527d7"..., 32768) = 32768
10:22:16 (+     0.134735) write(1, 
"{{{{{{{{{{{{{{8--3{{H/+,{;-''')'"..., 1689) = 1689
10:22:16 (+     0.305767) write(1, 
"@1fd49721-18cd-42c9-80b6-d125735"..., 32768) = 32768
10:22:16 (+     0.523892) write(1, 
"88:9<<<>@62235?>?>>?B{?>DBB@@CAC"..., 27888) = 27888

(Input file was read into cache for testing purposes. And it also stays 
there with 100% cache coverage...).

As far as I know there's currently no cache pressure on the MDS itself, 
or other processes on other clients that are trying to access the output 
file.

As a last test I read the already written content of the output file by 
dd. Similar to case above this should trigger caching its content (or at 
least part of it):

# dd if=output_file of=/dev/null bs=1M status=progress
920649728 bytes (921 MB, 878 MiB) copied, 86 s, 10.7 MB/s^C
922+0 records in
921+0 records out
965738496 bytes (966 MB, 921 MiB) copied, 87.2725 s, 11.1 MB/s

I aborted this test after short time since the result is obvious. I 
haven't tried accessing the same file from multiple clients yet (the web 
server cache problem reported by the thread starter), but I won't expect 
a different result.

Summary:

The cache management in this kernel seems to be broken; content is 
evicted too early, and writes seems to trigger a complete flush of 
files. I assume that extra round trips to the MDS are necessary for 
requested cache state capabilities (and releasing them afterwards), 
which results in a massive performance drop.

Since there are currently other metadata heavy workloads on our cluster 
I'll recheck after these ware finished. But I won't expect any 
significant changes.

Question to the developer:

Is it possible to revert to the old cache behavior without downgrading 
to kernel < 5.13?

Regards,

Burkhard

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx