Hi,
On 15.03.24 08:57, Szabo, Istvan (Agoda) wrote:
Hi,
I'd like to add cephfs to our production objectstore/block storage cluster so I'd like to collect hands on experiences like, good to know/be careful/avoid etc ... other than ceph documentation.
Just some aspects that might not be obvious on first sight:
1. CephFS does not perform authorization on the MDS. You have to trust
all your clients to behave correctly. If this is not possible you can
have a NFS server export CephFS (and use kerberos authentication, GID
management e.g. from LDAP/AD etc.)
2. CephFS behaves different than e.g. NFS. It has a much stronger
consistency model. Certain operations which are fine on NFS are an anti
pattern for CephFS, e.g. running hundreds of jobs on a compute cluster
and redirect output to a single file. Your users will have to adopt.
3. CephFS maintains an object for each file in the first data pool. This
object has at least one xattr value attached that is crucial for
desaster recovery. You first data pool thus cannot be an EC pool. I'm
not sure how often this value is updated (e.g. does it contain mtime?).
If you plan to use an EC pool for data storage, you need three pools:
metadata, replicated data pool as first pool, EC pool as third pool. You
can use filesystem attributes to control which pool is used for data
storage. This is the setup of our main filesystem (using only the EC
pool for data):
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX
AVAIL
xxx_metadata 74 64 203 GiB 27.87M 608 GiB 3.05 6.3 TiB
xxx_data_rep 76 256 0 B 357.92M 0 B 0 6.3 TiB
xxx_data_ec 77 4096 2.0 PiB 852.56M 2.4 PiB 50.97 1.8 PiB
4. MDS is a memory hog and mostly single threaded. Metadata performance
depends on cpu speed, and especially on the amount of RAM available.
More RAM, more cache inode information.
5. Avoid workloads having too many files open at the same time. Each
file being access require a capability reservation on the MDS, which
consumes a certain amount of memory. More client with more open files ->
more RAM needed.
6. Finally: In case of a failover, the then-active MDS has to be
reconnected by all clients. It will collect inode information for all
open files during this phase. This can consume a lot of memory, and it
will definitely take some time depending on the performance of the ceph
cluster. If you have too many files open, the MDS may run into a timeout
and restart, resulting in a restart loop. I fixed this problem in the
past by extending the timeout.
Overall CephFS is a great system, but you need to know your current and
future workloads to configure it accordingly. This is also true for any
other shared filesystem.
Best regards,
Burkhard Linke
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx