Re: Storage usage of CephFS-MDS

David Turner <drakonstein@xxxxxxxxx> · Mon, 26 Feb 2018 15:54:38 +0000

When a Ceph system is in recovery, it uses much more RAM than it does while running healthy.  This increase is often on the order of 4x more memory (at least back in the days of filestore, I'm not 100% certain about bluestore, but I would assume the same applies).  You have another thread on the ML where you are under-provisioned on the recommended memory sizes for your cluster by more than half.  This could be impacting your recovery.  Are you noticing any OOM killer messages during this recovery?  Are OSDs flapping up and down?  You would see this by additional peering in the status while you're recovering.
You mentioned that you increased the max backfills.  What did you set that to?  I usually watch the `ceph status` for slow requests and the OSD disks `iostat` to know what I can sanely increase the max backfills to for a cluster as all hardware variable make each cluster different.  Have you confirmed that the recovery sleep is indeed 0 or are you assuming it is?  You can check this by querying the OSD daemon.

RocksDB usage is known to scale up with the amount of objects.  For single write and not modifying the object, you're likely to see around 6KB of used RocksDB space per object.  If you modify the object regularly, this size will increase.  A safe guess for RocksDB partition sizing is 10GB per 1TB of storage, but that number does not apply to systems with immense amounts of small objects.  You'll want to try to calculate that out yourself with a guestimate of how many objects you'll have and if they'll be modified after they're written.  7KB/object is a safe number to guestimate with for objects not modified.  You should be able to calculate out the numbers for your environment easily enough though.

On Mon, Feb 26, 2018 at 6:01 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
Dear Cephalopodians,

I have to extend my question a bit - in our system with 105,000,000 objects in CephFS (mostly stabilized now after the stress-testing...),

I observe the following data distribution for the metadata pool:

# ceph osd df | head

ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE  VAR  PGS

  0   ssd 0.21829  1.00000  223G  9927M  213G  4.34 0.79   0

  1   ssd 0.21829  1.00000  223G  9928M  213G  4.34 0.79   0

  2   ssd 0.21819  1.00000  223G 77179M  148G 33.73 6.11 128

  3   ssd 0.21819  1.00000  223G 76981M  148G 33.64 6.10 128

osd.0 - osd.3 are all exclusively meant for cephfs-metadata, currently we use 4 replicas with failure domain OSD there.

I have reinstalled and reformatted osd.0 and osd.1 about 36 hours ago.

All 128 PGs in the metadata pool are backfilling (I have increased osd-max-backfills temporarily to speed things up for those OSDs).

However, they only managed to backfill < 10 GB in those 36 hours. I have not touched any other of the default settings concerning backfill

or recovery (but these are SSDs, so sleeps should be 0).

The backfilling seems not to be limited by CPU, nor network, not disks.

"ceph -s" confirms a backfill performance of about 60-100 keys/s.

This metadata, as written before, is almost exclusively RocksDB:

    "bluefs": {

        "gift_bytes": 0,

        "reclaim_bytes": 0,

        "db_total_bytes": 84760592384,

        "db_used_bytes": 77289488384,

is it normal that this kind of backfilling is so horrendously slow? Is there a way to speed it up?

Like this, it will take almost two weeks for 77 GB of (meta)data.

Right now, the system is still in the testing phase, but we'd of course like to be able to add more MDS's and SSD's later without extensive backfilling periods.

Cheers,

        Oliver

Am 25.02.2018 um 19:26 schrieb Oliver Freyermuth:

> Dear Cephalopodians,

>

> as part of our stress test with 100,000,000 objects (all small files) we ended up with

> the following usage on the OSDs on which the metadata pool lives:

> # ceph osd df | head

> ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE  VAR  PGS

> [...]

>   2   ssd 0.21819  1.00000  223G 79649M  145G 34.81 6.62 128

>   3   ssd 0.21819  1.00000  223G 79697M  145G 34.83 6.63 128

>

> The cephfs-data cluster is mostly empty (5 % usage), but contains 100,000,000 small objects.

>

> Looking with:

> ceph daemon osd.2 perf dump

> I get:

>     "bluefs": {

>         "gift_bytes": 0,

>         "reclaim_bytes": 0,

>         "db_total_bytes": 84760592384,

>         "db_used_bytes": 78920024064,

>         "wal_total_bytes": 0,

>         "wal_used_bytes": 0,

>         "slow_total_bytes": 0,

>         "slow_used_bytes": 0,

> so it seems this is almost exclusively RocksDB usage.

>

> Is this expected?

> Is there a recommendation on how much MDS storage is needed for a CephFS with 450 TB?

>

> Cheers,

>       Oliver

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com