Re: Robust cephfs design/best practice

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Fri, 15 Mar 2024 09:27:45 +0100

Hi,

On 15.03.24 08:57, Szabo, Istvan (Agoda) wrote:
Hi,

I'd like to add cephfs to our production objectstore/block storage cluster so I'd like to collect hands on experiences like, good to know/be careful/avoid etc ... other than ceph documentation.

Just some aspects that might not be obvious on first sight:

1. CephFS does not perform authorization on the MDS. You have to trust 
all your clients to behave correctly. If this is not possible you can 
have a NFS server export CephFS (and use kerberos authentication, GID 
management e.g. from LDAP/AD etc.)

2. CephFS behaves different than e.g. NFS. It has a much stronger 
consistency model. Certain operations which are fine on NFS are an anti 
pattern for CephFS, e.g. running hundreds of jobs on a compute cluster 
and redirect output to a single file. Your users will have to adopt.

3. CephFS maintains an object for each file in the first data pool. This 
object has at least one xattr value attached that is crucial for 
desaster recovery. You first data pool thus cannot be an EC pool. I'm 
not sure how often this value is updated (e.g. does it contain mtime?). 
If you plan to use an EC pool for data storage, you need three pools: 
metadata, replicated data pool as first pool, EC pool as third pool. You 
can use filesystem attributes to control which pool is used for data 
storage. This is the setup of our main filesystem (using only the EC 
pool for data):

--- POOLS ---
POOL                     ID   PGS   STORED  OBJECTS     USED %USED  MAX 
AVAIL
xxx_metadata          74    64  203 GiB   27.87M  608 GiB 3.05    6.3 TiB
xxx_data_rep          76   256      0 B  357.92M      0 B 0    6.3 TiB
xxx_data_ec           77  4096  2.0 PiB  852.56M  2.4 PiB 50.97    1.8 PiB

4. MDS is a memory hog and mostly single threaded. Metadata performance 
depends on cpu speed, and especially on the amount of RAM available. 
More RAM, more cache inode information.

5. Avoid workloads having too many files open at the same time. Each 
file being access require a capability reservation on the MDS, which 
consumes a certain amount of memory. More client with more open files -> 
more RAM needed.

6. Finally: In case of a failover, the then-active MDS has to be 
reconnected by all clients. It will collect inode information for all 
open files during this phase. This can consume a lot of memory, and it 
will definitely take some time depending on the performance of the ceph 
cluster. If you have too many files open, the MDS may run into a timeout 
and restart, resulting in a restart loop. I fixed this problem in the 
past by extending the timeout.

Overall CephFS is a great system, but you need to know your current and 
future workloads to configure it accordingly. This is also true for any 
other shared filesystem.

Best regards,

Burkhard Linke

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx