Re: Robust cephfs design/best practice

"Alexander E. Patrakov" <patrakov@xxxxxxxxx> · Fri, 15 Mar 2024 23:40:28 +0800

Hi Istvan,

I would like to add a few notes to what Burkhard mentioned already.

First, CephFS has a built-in feature that allows restricting access to
a certain directory:

ceph fs authorize cephfs client.public-only /public rw

This creates a key with the following caps:

caps mds = "allow rw path=/public"
caps mon = "allow r"
caps osd = "allow rw tag cephfs data=cephfs"

This is still a problem if a malicious client guesses the inode number
and tries to access it through OSDs directly, by predicting object
names. This can be thwarted by using namespaces, but it's a bit
cumbersome. First, before placing any single file into this directory,
you have to assign a namespace (here "public", but it's just a string,
you can use any other string) to it:

setfattr -n ceph.dir.layout.pool_namespace -v public /mnt/cephfs/public

Then, adjust the key to have the following caps:

caps mds = "allow rw path=/public"
caps mon = "allow r"
caps osd "allow rw pool=cephfs_data namespace=public,allow rw
pool=cephfs_data namespace=public"

Second, CephFS has a feature where any directory can be snapshotted.
There is also an mgr module "snap_schedule" that allows scheduling
snapshot creation - e.g., creating daily or weekly snapshots of your
data. This module also allows setting retention policies. However, a
lot of users who tried it ended up with unsatisfactory performance:
when removing snapshots (scheduled or not), and actually during any
big removals, the MR_Finisher thread of the MDS often ends up
consuming 100% of a CPU core, sometimes impacting other operations.
The only advice that I have here is to snapshot only what you need and
only as often as you need.

On Fri, Mar 15, 2024 at 4:28 PM Burkhard Linke
<Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 15.03.24 08:57, Szabo, Istvan (Agoda) wrote:
> > Hi,
> >
> > I'd like to add cephfs to our production objectstore/block storage cluster so I'd like to collect hands on experiences like, good to know/be careful/avoid etc ... other than ceph documentation.
>
>
> Just some aspects that might not be obvious on first sight:
>
>
> 1. CephFS does not perform authorization on the MDS. You have to trust
> all your clients to behave correctly. If this is not possible you can
> have a NFS server export CephFS (and use kerberos authentication, GID
> management e.g. from LDAP/AD etc.)
>
>
> 2. CephFS behaves different than e.g. NFS. It has a much stronger
> consistency model. Certain operations which are fine on NFS are an anti
> pattern for CephFS, e.g. running hundreds of jobs on a compute cluster
> and redirect output to a single file. Your users will have to adopt.
>
>
> 3. CephFS maintains an object for each file in the first data pool. This
> object has at least one xattr value attached that is crucial for
> desaster recovery. You first data pool thus cannot be an EC pool. I'm
> not sure how often this value is updated (e.g. does it contain mtime?).
> If you plan to use an EC pool for data storage, you need three pools:
> metadata, replicated data pool as first pool, EC pool as third pool. You
> can use filesystem attributes to control which pool is used for data
> storage. This is the setup of our main filesystem (using only the EC
> pool for data):
>
> --- POOLS ---
> POOL                     ID   PGS   STORED  OBJECTS     USED %USED  MAX
> AVAIL
> xxx_metadata          74    64  203 GiB   27.87M  608 GiB 3.05    6.3 TiB
> xxx_data_rep          76   256      0 B  357.92M      0 B 0    6.3 TiB
> xxx_data_ec           77  4096  2.0 PiB  852.56M  2.4 PiB 50.97    1.8 PiB
>
>
> 4. MDS is a memory hog and mostly single threaded. Metadata performance
> depends on cpu speed, and especially on the amount of RAM available.
> More RAM, more cache inode information.
>
>
> 5. Avoid workloads having too many files open at the same time. Each
> file being access require a capability reservation on the MDS, which
> consumes a certain amount of memory. More client with more open files ->
> more RAM needed.
>
>
> 6. Finally: In case of a failover, the then-active MDS has to be
> reconnected by all clients. It will collect inode information for all
> open files during this phase. This can consume a lot of memory, and it
> will definitely take some time depending on the performance of the ceph
> cluster. If you have too many files open, the MDS may run into a timeout
> and restart, resulting in a restart loop. I fixed this problem in the
> past by extending the timeout.
>
>
> Overall CephFS is a great system, but you need to know your current and
> future workloads to configure it accordingly. This is also true for any
> other shared filesystem.
>
>
> Best regards,
>
> Burkhard Linke
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx