Cephfs MDS tunning for deep-learning data-flow

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello everyone! How are you doing?
I wasn't around for two years but I'm back and working on a new development.

I deployed 2x ceph cluster:
1- user_data:5x node [8x4TB Sata SSD, 2x 25Gbit network],
2- data-gen: 3x node [8x4TB Sata SSD, 2x 25Gbit network],

note: hardware is not my choice and I know I have TRIM issue and also I
couldn't use any PCI-E nvme for wal+db because 1u servers and no empty slots
---------------------

At test phase everything was good, I reached 1GB/s for 18 clients at
the same time.
But when I migrate to production (60 GPU server client + 40 CPU server
client) the speed issue begin because of the default parameters as usual
and now I'm working on adaptation by debugging current data work flow I
have and I'm researching how can I improve my environment.

So far, I couldn't find useful guide or informations in one place and I
just wanted to share my findings, benchmarks and ideas with the community
and if I'm lucky enough, maybe I will get awesome recommendations from some
old friends and enjoy get in touch after a while. :)

Starting from here, I will only share technical information about my
environment:

1- Cluster user_data: 5x node [8x4TB Sata SSD, 2x 25Gbit network] =
Replication 2
- A: I only have 1 pool in this cluster and information is below:
- ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
ssd    146 TiB  106 TiB  40 TiB    40 TiB      27.50
TOTAL  146 TiB  106 TiB  40 TiB    40 TiB      27.50

--- POOLS ---
POOL                 ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                  1     1  286 MiB       73  859 MiB      0     32 TiB
cephfs.ud-data.meta   9   512   65 GiB    2.87M  131 GiB   0.13     48 TiB
cephfs.ud-data.data  10  2048   23 TiB   95.34M   40 TiB  29.39     48 TiB


- B: In this cluster, every user(50) has a subvolume and the quota is
1TB/for each users
- C: In each subvolume, users has "home and data" directory.
- D: home directory size 5-10GB and client uses it for docker home
directory at each login
- E: I'm also storing users personal or development data around 2TB/each
user
- F: I only have 1x active MDS server and 4x standby as below.

- ceph fs status

> ud-data - 84 clients
> =======
> RANK  STATE           MDS              ACTIVITY     DNS    INOS   DIRS
> CAPS
>  0    active  ud-data.ud-04.seggyv  Reqs:  372 /s  4343k  4326k  69.7k
>  2055k
>         POOL           TYPE     USED  AVAIL
> cephfs.ud-data.meta  metadata   130G  47.5T
> cephfs.ud-data.data    data    39.5T  47.5T
>     STANDBY MDS
> ud-data.ud-01.uatjle
> ud-data.ud-02.xcoojt
> ud-data.ud-05.rnhcfe
> ud-data.ud-03.lhwkml
> MDS version: ceph version 17.2.6
> (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)



- What is my issue?
2023-12-15T21:07:47.175542+0000 mon.ud-01 [WRN] Health check failed: 1
clients failing to respond to cache pressure (MDS_CLIENT_RECALL)
2023-12-15T21:09:35.002112+0000 mon.ud-01 [INF] MDS health message cleared
(mds.?): Client gpu-server-11 failing to respond to cache pressure
2023-12-15T21:09:35.391235+0000 mon.ud-01 [INF] Health check cleared:
MDS_CLIENT_RECALL (was: 1 clients failing to respond to cache pressure)
2023-12-15T21:09:35.391304+0000 mon.ud-01 [INF] Cluster is now healthy
2023-12-15T21:10:00.000169+0000 mon.ud-01 [INF] overall HEALTH_OK

For every read and write in client's trying to reach ceph MDS server and
requests some data:
issue 1: home data is around 5-10GB and users need all the time. I need to
store it one time and prevent new requests.

issue 2: users process generates new data by only reading some data one
time and they write generated data one time. No need to cache this data at
all.

What I want to do ???

1- I want to deploy 2x active MDS server for only "home" directory in each
subvolume:
- These 2x home MDS servers must send the data to client and cache in the
client to reduce new requests even for simple "ls" command

2-  I want to deploy 2x active MDS server for only "data" directory in each
subvolume:
- These 2x MDS servers must be configured to not hold any CACHE if it is
not required constantly. The cache life time must be short and must be
independent.
- Constantly requested data from one client must be cached locally in that
client to reduce requests and load on the MDS server.

------------------------------------------------------------
I believe you understand my data-flow and my needs. Let's talk what we can
do about it.

Note: I'm still researching and these are my finding and my plan so far. it
is not completed, and this is the main reason why I'm writing this mail.

ceph fs set $MYFS max_mds 4
mds_cache_memory_limit      | default 4GiB --> 16GiB
mds_cache_reservation          | default 0.05 --> ??
mds_health_cache_threshold | default 1.5 --> ??
mds_cache_trim_threshold     | default 256KiB --> ??
mds_cache_trim_decay_rate   | default 1.0 --> ??
mds_cache_mid
mds_decay_halflife
mds_client_prealloc_inos
mds_dirstat_min_interval
mds_session_cache_liveness_magnitude
mds_session_cache_liveness_decay_rate
mds_max_caps_per_client
mds_recall_max_caps
mds_recall_max_decay_threshold
mds_recall_max_decay_rate
mds_recall_global_max_decay_threshold
mds_session_cap_acquisition_throttle
mds_session_cap_acquisition_decay_rate
mds_session_max_caps_throttle_ratio
mds_cap_acquisition_throttle_retry_request_timeout
-Manually pinning directory trees to a particular rank


As you can see, I'm at the beginning of this journey and I will be grateful
if you can help me, share your knowledge, even I'm ready to help developers
to use my system as a test bench to improve ceph as always!

Best regards folks!
- Özkan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux