Re: Robust cephfs design/best practice

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



On 15.03.24 08:57, Szabo, Istvan (Agoda) wrote:

I'd like to add cephfs to our production objectstore/block storage cluster so I'd like to collect hands on experiences like, good to know/be careful/avoid etc ... other than ceph documentation.

Just some aspects that might not be obvious on first sight:

1. CephFS does not perform authorization on the MDS. You have to trust all your clients to behave correctly. If this is not possible you can have a NFS server export CephFS (and use kerberos authentication, GID management e.g. from LDAP/AD etc.)

2. CephFS behaves different than e.g. NFS. It has a much stronger consistency model. Certain operations which are fine on NFS are an anti pattern for CephFS, e.g. running hundreds of jobs on a compute cluster and redirect output to a single file. Your users will have to adopt.

3. CephFS maintains an object for each file in the first data pool. This object has at least one xattr value attached that is crucial for desaster recovery. You first data pool thus cannot be an EC pool. I'm not sure how often this value is updated (e.g. does it contain mtime?). If you plan to use an EC pool for data storage, you need three pools: metadata, replicated data pool as first pool, EC pool as third pool. You can use filesystem attributes to control which pool is used for data storage. This is the setup of our main filesystem (using only the EC pool for data):

--- POOLS ---
POOL                     ID   PGS   STORED  OBJECTS     USED %USED  MAX AVAIL
xxx_metadata          74    64  203 GiB   27.87M  608 GiB 3.05    6.3 TiB
xxx_data_rep          76   256      0 B  357.92M      0 B 0    6.3 TiB
xxx_data_ec           77  4096  2.0 PiB  852.56M  2.4 PiB 50.97    1.8 PiB

4. MDS is a memory hog and mostly single threaded. Metadata performance depends on cpu speed, and especially on the amount of RAM available. More RAM, more cache inode information.

5. Avoid workloads having too many files open at the same time. Each file being access require a capability reservation on the MDS, which consumes a certain amount of memory. More client with more open files -> more RAM needed.

6. Finally: In case of a failover, the then-active MDS has to be reconnected by all clients. It will collect inode information for all open files during this phase. This can consume a lot of memory, and it will definitely take some time depending on the performance of the ceph cluster. If you have too many files open, the MDS may run into a timeout and restart, resulting in a restart loop. I fixed this problem in the past by extending the timeout.

Overall CephFS is a great system, but you need to know your current and future workloads to configure it accordingly. This is also true for any other shared filesystem.

Best regards,

Burkhard Linke

ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux