Re: Cephfs MDS tunning for deep-learning data-flow

mhnx <morphinwithyou@xxxxxxxxx> · Sat, 16 Dec 2023 04:22:35 +0300

I found something useful and I think I need to dig and use this %100

https://docs.ceph.com/en/reef/cephfs/multimds/#dynamic-subtree-partitioning-with-balancer-on-specific-ranks

DYNAMIC SUBTREE PARTITIONING WITH BALANCER ON SPECIFIC RANKS

The CephFS file system provides the bal_rank_mask option to enable the
balancer to dynamically rebalance subtrees within particular active
MDS ranks. This allows administrators to employ both the dynamic
subtree partitioning and static pining schemes in different active MDS
ranks so that metadata loads are optimized based on user demand. For
instance, in realistic cloud storage environments, where a lot of
subvolumes are allotted to multiple computing nodes (e.g., VMs and
containers), some subvolumes that require high performance are managed
by static partitioning, whereas most subvolumes that experience a
moderate workload are managed by the balancer. As the balancer evenly
spreads the metadata workload to all active MDS ranks, performance of
static pinned subvolumes inevitably may be affected or degraded. If
this option is enabled, subtrees managed by the balancer are not
affected by static pinned subtrees.

This option can be configured with the ceph fs set command. For example:

ceph fs set <fs_name> bal_rank_mask <hex>

Each bitfield of the <hex> number represents a dedicated rank. If the
<hex> is set to 0x3, the balancer runs on active 0 and 1 ranks. For
example:

ceph fs set <fs_name> bal_rank_mask 0x3

If the bal_rank_mask is set to -1 or all, all active ranks are masked
and utilized by the balancer. As an example:

ceph fs set <fs_name> bal_rank_mask -1

On the other hand, if the balancer needs to be disabled, the
bal_rank_mask should be set to 0x0. For example:

ceph fs set <fs_name> bal_rank_mask 0x0

mhnx <morphinwithyou@xxxxxxxxx>, 16 Ara 2023 Cmt, 03:43 tarihinde şunu yazdı:
>
> Hello everyone! How are you doing?
> I wasn't around for two years but I'm back and working on a new development.
>
> I deployed 2x ceph cluster:
> 1- user_data:5x node [8x4TB Sata SSD, 2x 25Gbit network],
> 2- data-gen: 3x node [8x4TB Sata SSD, 2x 25Gbit network],
>
> note: hardware is not my choice and I know I have TRIM issue and also I couldn't use any PCI-E nvme for wal+db because 1u servers and no empty slots
> ---------------------
>
> At test phase everything was good, I reached 1GB/s for 18 clients at the same time.
> But when I migrate to production (60 GPU server client + 40 CPU server client) the speed issue begin because of the default parameters as usual and now I'm working on adaptation by debugging current data work flow I have and I'm researching how can I improve my environment.
>
> So far, I couldn't find useful guide or informations in one place and I just wanted to share my findings, benchmarks and ideas with the community and if I'm lucky enough, maybe I will get awesome recommendations from some old friends and enjoy get in touch after a while. :)
>
> Starting from here, I will only share technical information about my environment:
>
> 1- Cluster user_data: 5x node [8x4TB Sata SSD, 2x 25Gbit network] = Replication 2
> - A: I only have 1 pool in this cluster and information is below:
> - ceph df
> --- RAW STORAGE ---
> CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
> ssd    146 TiB  106 TiB  40 TiB    40 TiB      27.50
> TOTAL  146 TiB  106 TiB  40 TiB    40 TiB      27.50
>
> --- POOLS ---
> POOL                 ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
> .mgr                  1     1  286 MiB       73  859 MiB      0     32 TiB
> cephfs.ud-data.meta   9   512   65 GiB    2.87M  131 GiB   0.13     48 TiB
> cephfs.ud-data.data  10  2048   23 TiB   95.34M   40 TiB  29.39     48 TiB
>
>
> - B: In this cluster, every user(50) has a subvolume and the quota is 1TB/for each users
> - C: In each subvolume, users has "home and data" directory.
> - D: home directory size 5-10GB and client uses it for docker home directory at each login
> - E: I'm also storing users personal or development data around 2TB/each user
> - F: I only have 1x active MDS server and 4x standby as below.
>
> - ceph fs status
>>
>> ud-data - 84 clients
>> =======
>> RANK  STATE           MDS              ACTIVITY     DNS    INOS   DIRS   CAPS
>>  0    active  ud-data.ud-04.seggyv  Reqs:  372 /s  4343k  4326k  69.7k  2055k
>>         POOL           TYPE     USED  AVAIL
>> cephfs.ud-data.meta  metadata   130G  47.5T
>> cephfs.ud-data.data    data    39.5T  47.5T
>>     STANDBY MDS
>> ud-data.ud-01.uatjle
>> ud-data.ud-02.xcoojt
>> ud-data.ud-05.rnhcfe
>> ud-data.ud-03.lhwkml
>> MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
>
>
>
> - What is my issue?
> 2023-12-15T21:07:47.175542+0000 mon.ud-01 [WRN] Health check failed: 1 clients failing to respond to cache pressure (MDS_CLIENT_RECALL)
> 2023-12-15T21:09:35.002112+0000 mon.ud-01 [INF] MDS health message cleared (mds.?): Client gpu-server-11 failing to respond to cache pressure
> 2023-12-15T21:09:35.391235+0000 mon.ud-01 [INF] Health check cleared: MDS_CLIENT_RECALL (was: 1 clients failing to respond to cache pressure)
> 2023-12-15T21:09:35.391304+0000 mon.ud-01 [INF] Cluster is now healthy
> 2023-12-15T21:10:00.000169+0000 mon.ud-01 [INF] overall HEALTH_OK
>
> For every read and write in client's trying to reach ceph MDS server and requests some data:
> issue 1: home data is around 5-10GB and users need all the time. I need to store it one time and prevent new requests.
>
> issue 2: users process generates new data by only reading some data one time and they write generated data one time. No need to cache this data at all.
>
> What I want to do ???
>
> 1- I want to deploy 2x active MDS server for only "home" directory in each subvolume:
> - These 2x home MDS servers must send the data to client and cache in the client to reduce new requests even for simple "ls" command
>
> 2-  I want to deploy 2x active MDS server for only "data" directory in each subvolume:
> - These 2x MDS servers must be configured to not hold any CACHE if it is not required constantly. The cache life time must be short and must be independent.
> - Constantly requested data from one client must be cached locally in that client to reduce requests and load on the MDS server.
>
> ------------------------------------------------------------
> I believe you understand my data-flow and my needs. Let's talk what we can do about it.
>
> Note: I'm still researching and these are my finding and my plan so far. it is not completed, and this is the main reason why I'm writing this mail.
>
> ceph fs set $MYFS max_mds 4
> mds_cache_memory_limit      | default 4GiB --> 16GiB
> mds_cache_reservation          | default 0.05 --> ??
> mds_health_cache_threshold | default 1.5 --> ??
> mds_cache_trim_threshold     | default 256KiB --> ??
> mds_cache_trim_decay_rate   | default 1.0 --> ??
> mds_cache_mid
> mds_decay_halflife
> mds_client_prealloc_inos
> mds_dirstat_min_interval
> mds_session_cache_liveness_magnitude
> mds_session_cache_liveness_decay_rate
> mds_max_caps_per_client
> mds_recall_max_caps
> mds_recall_max_decay_threshold
> mds_recall_max_decay_rate
> mds_recall_global_max_decay_threshold
> mds_session_cap_acquisition_throttle
> mds_session_cap_acquisition_decay_rate
> mds_session_max_caps_throttle_ratio
> mds_cap_acquisition_throttle_retry_request_timeout
> -Manually pinning directory trees to a particular rank
>
>
> As you can see, I'm at the beginning of this journey and I will be grateful if you can help me, share your knowledge, even I'm ready to help developers to use my system as a test bench to improve ceph as always!
>
> Best regards folks!
> - Özkan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx