Re: Storage usage of CephFS-MDS

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 26 Feb 2018 19:31:10 +0000

On Mon, Feb 26, 2018 at 11:26 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
Am 26.02.2018 um 20:09 schrieb Oliver Freyermuth:

> Am 26.02.2018 um 19:56 schrieb Gregory Farnum:

>>

>>

>> On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:

>>

>>     Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:

>>     > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth

>>     > <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:

>>     >> Looking with:

>>     >> ceph daemon osd.2 perf dump

>>     >> I get:

>>     >>     "bluefs": {

>>     >>         "gift_bytes": 0,

>>     >>         "reclaim_bytes": 0,

>>     >>         "db_total_bytes": 84760592384,

>>     >>         "db_used_bytes": 78920024064,

>>     >>         "wal_total_bytes": 0,

>>     >>         "wal_used_bytes": 0,

>>     >>         "slow_total_bytes": 0,

>>     >>         "slow_used_bytes": 0,

>>     >> so it seems this is almost exclusively RocksDB usage.

>>     >>

>>     >> Is this expected?

>>     >

>>     > Yes. The directory entries are stored in the omap of the objects. This

>>     > will be stored in the RocksDB backend of Bluestore.

>>     >

>>     >> Is there a recommendation on how much MDS storage is needed for a CephFS with 450 TB?

>>     >

>>     > It seems in the above test you're using about 1KB per inode (file).

>>     > Using that you can extrapolate how much space the data pool needs

>>     > based on your file system usage. (If all you're doing is filling the

>>     > file system with empty files, of course you're going to need an

>>     > unusually large metadata pool.)

>>     >

>>     Many thanks, this helps!

>>     We naturally hope our users will not do this, this stress test was a worst case -

>>     but the rough number (1 kB per inode) does indeed help a lot, and also the increase with modifications

>>     of the file as laid out by David.

>>

>>     Is also the slow backfilling normal?

>>     Will such increase in storage (by many file modifications) at some point also be reduced, i.e.

>>     is the database compacted / can one trigger that / is there something like "SQL vacuum"?

>>

>>     To also answer David's questions in parallel:

>>     - Concerning the slow backfill, I am only talking about the "metadata OSDs".

>>       They are fully SSD backed, and have no separate device for block.db / WAL.

>>     - I adjusted backfills up to 128 for those metadata OSDs, the cluster is currently fully empty, i.e. no client's are doing anything.

>>       There are no slow requests.

>>       Since no clients are doing anything and the rest of the cluster is now clean (apart from the two backfilling OSDs),

>>       right now there is also no memory pressure at all.

>>       The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each.

>>       The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s of write throughput.

>>       Network traffic between the node with the clean OSDs and the "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly more bandwidth available...

>>     - Checking sleeps with:

>>     # ceph -n osd.1 --show-config | grep sleep

>>     osd_recovery_sleep = 0.000000

>>     osd_recovery_sleep_hdd = 0.100000

>>     osd_recovery_sleep_hybrid = 0.025000

>>     osd_recovery_sleep_ssd = 0.000000

>>     shows there should be 0 sleep. Or is there another way to query?

>>

>>

>> Check if the OSDs are reporting their stores or their journals to be "rotational" via "ceph osd metadata"?

>

> I find:

>         "bluestore_bdev_model": "Micron_5100_MTFD",

>         "bluestore_bdev_partition_path": "/dev/sda2",

>         "bluestore_bdev_rotational": "0",

>         "bluestore_bdev_size": "239951482880",

>         "bluestore_bdev_type": "ssd",

> [...]

>         "rotational": "0"

>

> for all of them (obviously with different device paths).

> Also, they've been assigned the ssd device class automatically:

> # ceph osd df | head

> ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE  VAR  PGS

>   0   ssd 0.21829  1.00000  223G 11310M  212G  4.94 0.94   0

>   1   ssd 0.21829  1.00000  223G 11368M  212G  4.97 0.95   0

>   2   ssd 0.21819  1.00000  223G 76076M  149G 33.25 6.35 128

>   3   ssd 0.21819  1.00000  223G 76268M  148G 33.33 6.37 128

>

> So this should not be the reason...

>

Checking again with the nice "grep" _expression_ from the other thread concerning bluestore backfilling...

# ceph osd metadata | grep 'id\|rotational'

yields:

        "id": 0,

        "bluefs_db_rotational": "0",

        "bluestore_bdev_rotational": "0",

        "journal_rotational": "1",

        "rotational": "0"

        "id": 1,

        "bluefs_db_rotational": "0",

        "bluestore_bdev_rotational": "0",

        "journal_rotational": "1",

        "rotational": "0"

        "id": 2,

        "bluefs_db_rotational": "0",

        "bluestore_bdev_rotational": "0",

        "journal_rotational": "1",

        "rotational": "0"

        "id": 3,

        "bluefs_db_rotational": "0",

        "bluestore_bdev_rotational": "0",

        "journal_rotational": "1",

        "rotational": "0"

        "id": 4,

        "bluefs_db_rotational": "0",

        "bluefs_slow_rotational": "1",

        "bluestore_bdev_rotational": "1",

        "journal_rotational": "1",

        "rotational": "1"

0-3 are pure SSDs, there is no separate block.db device.

Is "journal_rotational" really relevant for bluestore, though?

If so, detection seems broken...

For BlueStore it's using that config value to convey data about the WAL and db. As with that thread, check if your OS is lying (they often do) about the relevant block devices; at a quick skim the bluestore detection code looks correct to me.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com