Re: Storage usage of CephFS-MDS

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 26 Feb 2018 20:26:14 +0100

Am 26.02.2018 um 20:09 schrieb Oliver Freyermuth:
> Am 26.02.2018 um 19:56 schrieb Gregory Farnum:
>>
>>
>> On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
>>
>>     Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:
>>     > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
>>     > <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
>>     >> Looking with:
>>     >> ceph daemon osd.2 perf dump
>>     >> I get:
>>     >>     "bluefs": {
>>     >>         "gift_bytes": 0,
>>     >>         "reclaim_bytes": 0,
>>     >>         "db_total_bytes": 84760592384,
>>     >>         "db_used_bytes": 78920024064,
>>     >>         "wal_total_bytes": 0,
>>     >>         "wal_used_bytes": 0,
>>     >>         "slow_total_bytes": 0,
>>     >>         "slow_used_bytes": 0,
>>     >> so it seems this is almost exclusively RocksDB usage.
>>     >>
>>     >> Is this expected?
>>     >
>>     > Yes. The directory entries are stored in the omap of the objects. This
>>     > will be stored in the RocksDB backend of Bluestore.
>>     >
>>     >> Is there a recommendation on how much MDS storage is needed for a CephFS with 450 TB?
>>     >
>>     > It seems in the above test you're using about 1KB per inode (file).
>>     > Using that you can extrapolate how much space the data pool needs
>>     > based on your file system usage. (If all you're doing is filling the
>>     > file system with empty files, of course you're going to need an
>>     > unusually large metadata pool.)
>>     >
>>     Many thanks, this helps!
>>     We naturally hope our users will not do this, this stress test was a worst case -
>>     but the rough number (1 kB per inode) does indeed help a lot, and also the increase with modifications
>>     of the file as laid out by David.
>>
>>     Is also the slow backfilling normal?
>>     Will such increase in storage (by many file modifications) at some point also be reduced, i.e.
>>     is the database compacted / can one trigger that / is there something like "SQL vacuum"?
>>
>>     To also answer David's questions in parallel:
>>     - Concerning the slow backfill, I am only talking about the "metadata OSDs".
>>       They are fully SSD backed, and have no separate device for block.db / WAL.
>>     - I adjusted backfills up to 128 for those metadata OSDs, the cluster is currently fully empty, i.e. no client's are doing anything.
>>       There are no slow requests.
>>       Since no clients are doing anything and the rest of the cluster is now clean (apart from the two backfilling OSDs),
>>       right now there is also no memory pressure at all.
>>       The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each.
>>       The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s of write throughput.
>>       Network traffic between the node with the clean OSDs and the "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly more bandwidth available...
>>     - Checking sleeps with:
>>     # ceph -n osd.1 --show-config | grep sleep
>>     osd_recovery_sleep = 0.000000
>>     osd_recovery_sleep_hdd = 0.100000
>>     osd_recovery_sleep_hybrid = 0.025000
>>     osd_recovery_sleep_ssd = 0.000000
>>     shows there should be 0 sleep. Or is there another way to query?
>>
>>
>> Check if the OSDs are reporting their stores or their journals to be "rotational" via "ceph osd metadata"?
> 
> I find:
>         "bluestore_bdev_model": "Micron_5100_MTFD",
>         "bluestore_bdev_partition_path": "/dev/sda2",
>         "bluestore_bdev_rotational": "0",
>         "bluestore_bdev_size": "239951482880",
>         "bluestore_bdev_type": "ssd",
> [...]
>         "rotational": "0"
> 
> for all of them (obviously with different device paths). 
> Also, they've been assigned the ssd device class automatically: 
> # ceph osd df | head
> ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE  VAR  PGS                           
>   0   ssd 0.21829  1.00000  223G 11310M  212G  4.94 0.94   0                                       
>   1   ssd 0.21829  1.00000  223G 11368M  212G  4.97 0.95   0                                       
>   2   ssd 0.21819  1.00000  223G 76076M  149G 33.25 6.35 128                                           
>   3   ssd 0.21819  1.00000  223G 76268M  148G 33.33 6.37 128
> 
> So this should not be the reason... 
> 

Checking again with the nice "grep" expression from the other thread concerning bluestore backfilling...
# ceph osd metadata | grep 'id\|rotational'
yields:
        "id": 0,
        "bluefs_db_rotational": "0",
        "bluestore_bdev_rotational": "0",
        "journal_rotational": "1",
        "rotational": "0"
        "id": 1,
        "bluefs_db_rotational": "0",
        "bluestore_bdev_rotational": "0",
        "journal_rotational": "1",
        "rotational": "0"
        "id": 2,
        "bluefs_db_rotational": "0",
        "bluestore_bdev_rotational": "0",
        "journal_rotational": "1",
        "rotational": "0"
        "id": 3,
        "bluefs_db_rotational": "0",
        "bluestore_bdev_rotational": "0",
        "journal_rotational": "1",
        "rotational": "0"
        "id": 4,
        "bluefs_db_rotational": "0",
        "bluefs_slow_rotational": "1",
        "bluestore_bdev_rotational": "1",
        "journal_rotational": "1",
        "rotational": "1"
0-3 are pure SSDs, there is no separate block.db device. 
Is "journal_rotational" really relevant for bluestore, though?
If so, detection seems broken... 

For comparison, 4 is a HDD with block.db on an SSD. 

Cheers,
	Oliver

>>
>> If that's being detected wrong, that would cause them to be using those sleeps.
>> -Greg
>>
>>
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com