Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

Özkan Göksu <ozkangksu@xxxxxxxxx> · Wed, 17 Jan 2024 08:29:32 +0300

All of my clients are servers located at 2 hop away with 10Gbit network and
2x Xeon CPU/16++ cores and minimum 64GB ram with SSD OS drive + 8GB spare.
I use ceph kernel mount only and this is the command:
- mount.ceph admin@$fsid.ud-data=/volumes/subvolumegroup ${MOUNT_DIR} -o
name=admin,secret=XXX==,mon_addr=XXX

I think all of my clients have enough resources to answer MDS requests very
fast. The only possibility that any of my clients fails to respond to cache
pressure is the default settings at cephfs client or MDS server.

I have some problem with understanding how cephfs client works and why it
needs communication with MDS server for managing local cache.
And even at the beggining I didn't understand why MDS server needs direct
control over clients and tell them what to do. My mind does not understand
the concept and its logic.
To me, clients must be independent and they must manage their data flow
without any server side control. The client must send read and write
request to the remote server and return answer to the kernel.
Client can have read cache management future but it does not need
communicate with remote server. When a client detects multiple read for the
same object it should cache it with a set of protocols and release it when
it needed.
I don't understand why MDS needs to tell clients to release the allocation
and why client needs to report the release status back...

The logical answer for me is I think I'm looking from the wrong angle and
this is not the cache that I know from block filesystems.

With my use case, clients reads 50-100GB of data (10.000++ objects) only
one or two times with each runtime in few hours.

------------------------------------------------------------------------------------
While I was researching, I saw that some users recommends decreasing
"mds_max_caps_per_client" from 1M to 64K
# ceph config set mds mds_max_caps_per_client 65536

But if you check the reported client ls at previous mail you will see
"num_caps": 52092, for a failing client for cache pressure.
So its even under 64K and I'm not sure changing this value can help or not.

I want to repeat my main goal.
I'm not trying to solve cache pressure warning.
The ceph random read and write performance is not good and a lot of reads
from 80+ clients creates latency.
I'm trying to increase the speed by creating multiple MDS even maybe
binding subvolumes to specific MDS servers and decrease the latency.

Also when I check MDS CPU usage I see %120++ usage time to time. But when I
check the server CPU load at MDS location, I see MDS only uses 2-4 cores
and other CPU cores are almost at idle.
I think MDS has a CPU core limitation and I need to increase the value to
decrease the latency. How can I do that?

Özkan Göksu <ozkangksu@xxxxxxxxx>, 17 Oca 2024 Çar, 07:44 tarihinde şunu
yazdı:

> Let me share some outputs about my cluster.
>
> root@ud-01:~# ceph fs status
> ud-data - 84 clients
> =======
> RANK  STATE           MDS              ACTIVITY     DNS    INOS   DIRS
> CAPS
>  0    active  ud-data.ud-02.xcoojt  Reqs:   31 /s  3022k  3021k  52.6k
> 385k
>         POOL           TYPE     USED  AVAIL
> cephfs.ud-data.meta  metadata   136G  44.4T
> cephfs.ud-data.data    data    45.2T  44.4T
>     STANDBY MDS
> ud-data.ud-03.lhwkml
> ud-data.ud-05.rnhcfe
> ud-data.ud-01.uatjle
> ud-data.ud-04.seggyv
>
> --------------------------------------------------------------------------
> This is "ceph tell mds.ud-data.ud-02.xcoojt session ls" output for the
> reported client for cache pressure warning.
>
>     {
>         "id": 1282205,
>         "entity": {
>             "name": {
>                 "type": "client",
>                 "num": 1282205
>             },
>             "addr": {
>                 "type": "v1",
>                 "addr": "172.16.3.48:0",
>                 "nonce": 2169935642
>             }
>         },
>         "state": "open",
>         "num_leases": 0,
>         "num_caps": 52092,
>         "request_load_avg": 1,
>         "uptime": 75754.745608647994,
>         "requests_in_flight": 0,
>         "num_completed_requests": 0,
>         "num_completed_flushes": 1,
>         "reconnecting": false,
>         "recall_caps": {
>             "value": 2577232.0049106553,
>             "halflife": 60
>         },
>         "release_caps": {
>             "value": 1.4093491463510395,
>             "halflife": 60
>         },
>         "recall_caps_throttle": {
>             "value": 63733.985544098425,
>             "halflife": 1.5
>         },
>         "recall_caps_throttle2o": {
>             "value": 19452.428409271757,
>             "halflife": 0.5
>         },
>         "session_cache_liveness": {
>             "value": 14.100272208890081,
>             "halflife": 300
>         },
>         "cap_acquisition": {
>             "value": 0,
>             "halflife": 10
>         },
>         "delegated_inos": [
>             {
>                 "start": "0x10004a1c031",
>                 "length": 282
>             },
>             {
>                 "start": "0x10004a1c33f",
>                 "length": 207
>             },
>             {
>                 "start": "0x10004a1cdda",
>                 "length": 6
>             },
>             {
>                 "start": "0x10004a3c12e",
>                 "length": 3
>             },
>             {
>                 "start": "0x1000f9831fe",
>                 "length": 2
>             }
>         ],
>         "inst": "client.1282205 v1:172.16.3.48:0/2169935642",
>         "completed_requests": [],
>         "prealloc_inos": [
>             {
>                 "start": "0x10004a1c031",
>                 "length": 282
>             },
>             {
>                 "start": "0x10004a1c33f",
>                 "length": 207
>             },
>             {
>                 "start": "0x10004a1cdda",
>                 "length": 6
>             },
>             {
>                 "start": "0x10004a3c12e",
>                 "length": 3
>             },
>             {
>                 "start": "0x1000f9831fe",
>                 "length": 2
>             },
>             {
>                 "start": "0x1000fa86e5f",
>                 "length": 54
>             },
>             {
>                 "start": "0x1000faa069c",
>                 "length": 501
>             }
>         ],
>         "client_metadata": {
>             "client_features": {
>                 "feature_bits": "0x0000000000007bff"
>             },
>             "metric_spec": {
>                 "metric_flags": {
>                     "feature_bits": "0x00000000000003ff"
>                 }
>             },
>             "entity_id": "admin",
>             "hostname": "bennevis-2",
>             "kernel_version": "5.15.0-91-generic",
>             "root": "/volumes/babblians"
>         }
>     }
>
> Özkan Göksu <ozkangksu@xxxxxxxxx>, 17 Oca 2024 Çar, 07:22 tarihinde şunu
> yazdı:
>
>> Hello Eugen.
>>
>> Thank you for the answer.
>> According to knowledge and test results at this issue:
>> https://github.com/ceph/ceph/pull/38574
>> I've tried their advice and I've applied the following changes.
>>
>> max_mds = 4
>> standby_mds = 1
>> mds_cache_memory_limit = 16GB
>> mds_recall_max_caps = 40000
>>
>> When I set these parameters, 1 day later I saw this log:
>> [8531248.982954] Out of memory: Killed process 1580586 (ceph-mds)
>> total-vm:70577592kB, anon-rss:70244236kB, file-rss:0kB, shmem-rss:0kB,
>> UID:167 pgtables:137832kB oom_score_adj:0
>>
>> All the MDS services created memory leak and killed by kernel.
>> Because of this I changed it as below and it is stable now but
>> performance is very poor and I still get cache pressure alerts.
>>
>> max_mds = 1
>> standby_mds = 5
>> mds_cache_memory_limit = 8GB
>> mds_recall_max_caps = 30000
>>
>> I'm very surprised that you are advising to decrease
>> "mds_recall_max_caps" because it is the opposite of what developers advised
>> in the issue I've sended.
>> It is very hard to play around with MDS parameters without expert level
>> of understanding what these parameters stands for and how it will effect
>> the behavior.
>> Because of this I'm trying to understand the MDS code flow and I'm very
>> interested with learning more and tuning my system by debugging and
>> understanding my own data flow and MDS usage.
>>
>> I have a very unique data flow and I think I need to configure the system
>> for this case.
>> I have 80+ clients and via all of these clients my users are requesting
>> Read a range of objects and compare them in GPU, they generate new data and
>> Write the new data back in the cluster.
>> So it means my clients usually reads objects only one time and do not
>> read the same object again. Sometimes same user runs multiple service in
>> multiple clients and these services can read the same data from different
>> clients.
>>
>> So having a large cache is useless for my use case. I need to setup MDS
>> and Cephfs Client for this data flow.
>> When I debug the MDS ram usage, I see high allocation all the time and I
>> wonder why? If any of my client does not read any object why MDS does not
>> remove that data from ram allocation?
>> I need to configure MDS for reading the data and removing it very fast if
>> the data is constantly requested from clients. In this case ofc I want a
>> ram cache tier.
>>
>> I'm little confused and I need to learn more about how MDS works and how
>> should I make multiple active MDS faster for my subvolumes and client data
>> flow.
>>
>> Best regards.
>>
>>
>>
>> Eugen Block <eblock@xxxxxx>, 16 Oca 2024 Sal, 11:36 tarihinde şunu yazdı:
>>
>>> Hi,
>>>
>>> I have dealt with this topic multiple times, the SUSE team helped
>>> understanding what's going on under the hood. The summary can be found
>>> in this thread [1].
>>>
>>> What helped in our case was to reduce the mds_recall_max_caps from 30k
>>> (default) to 3k. We tried it in steps of 1k IIRC. So I suggest to
>>> reduce that value step by step (maybe start with 20k or something) to
>>> find the optimal value.
>>>
>>> Regards,
>>> Eugen
>>>
>>> [1] https://www.spinics.net/lists/ceph-users/msg73188.html
>>>
>>> Zitat von Özkan Göksu <ozkangksu@xxxxxxxxx>:
>>>
>>> > Hello.
>>> >
>>> > I have 5 node ceph cluster and I'm constantly having "clients failing
>>> to
>>> > respond to cache pressure" warning.
>>> >
>>> > I have 84 cephfs kernel clients (servers) and my users are accessing
>>> their
>>> > personal subvolumes  located on one pool.
>>> >
>>> > My users are software developers and the data is home and user data.
>>> (Git,
>>> > python projects, sample data and generated new data)
>>> >
>>> >
>>> ---------------------------------------------------------------------------------
>>> > --- RAW STORAGE ---
>>> > CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
>>> > ssd    146 TiB  101 TiB  45 TiB    45 TiB      30.71
>>> > TOTAL  146 TiB  101 TiB  45 TiB    45 TiB      30.71
>>> >
>>> > --- POOLS ---
>>> > POOL                 ID   PGS   STORED  OBJECTS     USED  %USED  MAX
>>> AVAIL
>>> > .mgr                  1     1  356 MiB       90  1.0 GiB      0     30
>>> TiB
>>> > cephfs.ud-data.meta   9   256   69 GiB    3.09M  137 GiB   0.15     45
>>> TiB
>>> > cephfs.ud-data.data  10  2048   26 TiB  100.83M   44 TiB  32.97     45
>>> TiB
>>> >
>>> ---------------------------------------------------------------------------------
>>> > root@ud-01:~# ceph fs status
>>> > ud-data - 84 clients
>>> > =======
>>> > RANK  STATE           MDS              ACTIVITY     DNS    INOS   DIRS
>>> > CAPS
>>> >  0    active  ud-data.ud-04.seggyv  Reqs:  142 /s  2844k  2798k   303k
>>> > 720k
>>> >         POOL           TYPE     USED  AVAIL
>>> > cephfs.ud-data.meta  metadata   137G  44.9T
>>> > cephfs.ud-data.data    data    44.2T  44.9T
>>> >     STANDBY MDS
>>> > ud-data.ud-02.xcoojt
>>> > ud-data.ud-05.rnhcfe
>>> > ud-data.ud-03.lhwkml
>>> > ud-data.ud-01.uatjle
>>> > MDS version: ceph version 17.2.6
>>> (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
>>> > quincy (stable)
>>> >
>>> >
>>> -----------------------------------------------------------------------------------
>>> > My MDS settings are below:
>>> >
>>> > mds_cache_memory_limit                | 8589934592
>>> > mds_cache_trim_threshold              | 524288
>>> > mds_recall_global_max_decay_threshold | 131072
>>> > mds_recall_max_caps                       | 30000
>>> > mds_recall_max_decay_rate             | 1.500000
>>> > mds_recall_max_decay_threshold    | 131072
>>> > mds_recall_warning_threshold          | 262144
>>> >
>>> >
>>> > I have 2 questions:
>>> > 1- What should I do to prevent cache pressue warning ?
>>> > 2- What can I do to increase speed ?
>>> >
>>> > - Thanks
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx