Hi Istvan, I would like to add a few notes to what Burkhard mentioned already. First, CephFS has a built-in feature that allows restricting access to a certain directory: ceph fs authorize cephfs client.public-only /public rw This creates a key with the following caps: caps mds = "allow rw path=/public" caps mon = "allow r" caps osd = "allow rw tag cephfs data=cephfs" This is still a problem if a malicious client guesses the inode number and tries to access it through OSDs directly, by predicting object names. This can be thwarted by using namespaces, but it's a bit cumbersome. First, before placing any single file into this directory, you have to assign a namespace (here "public", but it's just a string, you can use any other string) to it: setfattr -n ceph.dir.layout.pool_namespace -v public /mnt/cephfs/public Then, adjust the key to have the following caps: caps mds = "allow rw path=/public" caps mon = "allow r" caps osd "allow rw pool=cephfs_data namespace=public,allow rw pool=cephfs_data namespace=public" Second, CephFS has a feature where any directory can be snapshotted. There is also an mgr module "snap_schedule" that allows scheduling snapshot creation - e.g., creating daily or weekly snapshots of your data. This module also allows setting retention policies. However, a lot of users who tried it ended up with unsatisfactory performance: when removing snapshots (scheduled or not), and actually during any big removals, the MR_Finisher thread of the MDS often ends up consuming 100% of a CPU core, sometimes impacting other operations. The only advice that I have here is to snapshot only what you need and only as often as you need. On Fri, Mar 15, 2024 at 4:28 PM Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: > > Hi, > > On 15.03.24 08:57, Szabo, Istvan (Agoda) wrote: > > Hi, > > > > I'd like to add cephfs to our production objectstore/block storage cluster so I'd like to collect hands on experiences like, good to know/be careful/avoid etc ... other than ceph documentation. > > > Just some aspects that might not be obvious on first sight: > > > 1. CephFS does not perform authorization on the MDS. You have to trust > all your clients to behave correctly. If this is not possible you can > have a NFS server export CephFS (and use kerberos authentication, GID > management e.g. from LDAP/AD etc.) > > > 2. CephFS behaves different than e.g. NFS. It has a much stronger > consistency model. Certain operations which are fine on NFS are an anti > pattern for CephFS, e.g. running hundreds of jobs on a compute cluster > and redirect output to a single file. Your users will have to adopt. > > > 3. CephFS maintains an object for each file in the first data pool. This > object has at least one xattr value attached that is crucial for > desaster recovery. You first data pool thus cannot be an EC pool. I'm > not sure how often this value is updated (e.g. does it contain mtime?). > If you plan to use an EC pool for data storage, you need three pools: > metadata, replicated data pool as first pool, EC pool as third pool. You > can use filesystem attributes to control which pool is used for data > storage. This is the setup of our main filesystem (using only the EC > pool for data): > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX > AVAIL > xxx_metadata 74 64 203 GiB 27.87M 608 GiB 3.05 6.3 TiB > xxx_data_rep 76 256 0 B 357.92M 0 B 0 6.3 TiB > xxx_data_ec 77 4096 2.0 PiB 852.56M 2.4 PiB 50.97 1.8 PiB > > > 4. MDS is a memory hog and mostly single threaded. Metadata performance > depends on cpu speed, and especially on the amount of RAM available. > More RAM, more cache inode information. > > > 5. Avoid workloads having too many files open at the same time. Each > file being access require a capability reservation on the MDS, which > consumes a certain amount of memory. More client with more open files -> > more RAM needed. > > > 6. Finally: In case of a failover, the then-active MDS has to be > reconnected by all clients. It will collect inode information for all > open files during this phase. This can consume a lot of memory, and it > will definitely take some time depending on the performance of the ceph > cluster. If you have too many files open, the MDS may run into a timeout > and restart, resulting in a restart loop. I fixed this problem in the > past by extending the timeout. > > > Overall CephFS is a great system, but you need to know your current and > future workloads to configure it accordingly. This is also true for any > other shared filesystem. > > > Best regards, > > Burkhard Linke > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Alexander E. Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx