Re: Storage usage of CephFS-MDS

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 26 Feb 2018 20:09:26 +0100

Am 26.02.2018 um 19:56 schrieb Gregory Farnum:
> 
> 
> On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
> 
>     Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:
>     > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
>     > <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
>     >> Looking with:
>     >> ceph daemon osd.2 perf dump
>     >> I get:
>     >>     "bluefs": {
>     >>         "gift_bytes": 0,
>     >>         "reclaim_bytes": 0,
>     >>         "db_total_bytes": 84760592384,
>     >>         "db_used_bytes": 78920024064,
>     >>         "wal_total_bytes": 0,
>     >>         "wal_used_bytes": 0,
>     >>         "slow_total_bytes": 0,
>     >>         "slow_used_bytes": 0,
>     >> so it seems this is almost exclusively RocksDB usage.
>     >>
>     >> Is this expected?
>     >
>     > Yes. The directory entries are stored in the omap of the objects. This
>     > will be stored in the RocksDB backend of Bluestore.
>     >
>     >> Is there a recommendation on how much MDS storage is needed for a CephFS with 450 TB?
>     >
>     > It seems in the above test you're using about 1KB per inode (file).
>     > Using that you can extrapolate how much space the data pool needs
>     > based on your file system usage. (If all you're doing is filling the
>     > file system with empty files, of course you're going to need an
>     > unusually large metadata pool.)
>     >
>     Many thanks, this helps!
>     We naturally hope our users will not do this, this stress test was a worst case -
>     but the rough number (1 kB per inode) does indeed help a lot, and also the increase with modifications
>     of the file as laid out by David.
> 
>     Is also the slow backfilling normal?
>     Will such increase in storage (by many file modifications) at some point also be reduced, i.e.
>     is the database compacted / can one trigger that / is there something like "SQL vacuum"?
> 
>     To also answer David's questions in parallel:
>     - Concerning the slow backfill, I am only talking about the "metadata OSDs".
>       They are fully SSD backed, and have no separate device for block.db / WAL.
>     - I adjusted backfills up to 128 for those metadata OSDs, the cluster is currently fully empty, i.e. no client's are doing anything.
>       There are no slow requests.
>       Since no clients are doing anything and the rest of the cluster is now clean (apart from the two backfilling OSDs),
>       right now there is also no memory pressure at all.
>       The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each.
>       The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s of write throughput.
>       Network traffic between the node with the clean OSDs and the "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly more bandwidth available...
>     - Checking sleeps with:
>     # ceph -n osd.1 --show-config | grep sleep
>     osd_recovery_sleep = 0.000000
>     osd_recovery_sleep_hdd = 0.100000
>     osd_recovery_sleep_hybrid = 0.025000
>     osd_recovery_sleep_ssd = 0.000000
>     shows there should be 0 sleep. Or is there another way to query?
> 
> 
> Check if the OSDs are reporting their stores or their journals to be "rotational" via "ceph osd metadata"?

I find:
        "bluestore_bdev_model": "Micron_5100_MTFD",
        "bluestore_bdev_partition_path": "/dev/sda2",
        "bluestore_bdev_rotational": "0",
        "bluestore_bdev_size": "239951482880",
        "bluestore_bdev_type": "ssd",
[...]
        "rotational": "0"

for all of them (obviously with different device paths). 
Also, they've been assigned the ssd device class automatically: 
# ceph osd df | head
ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE  VAR  PGS                           
  0   ssd 0.21829  1.00000  223G 11310M  212G  4.94 0.94   0                                       
  1   ssd 0.21829  1.00000  223G 11368M  212G  4.97 0.95   0                                       
  2   ssd 0.21819  1.00000  223G 76076M  149G 33.25 6.35 128                                           
  3   ssd 0.21819  1.00000  223G 76268M  148G 33.33 6.37 128

So this should not be the reason... 

> 
> If that's being detected wrong, that would cause them to be using those sleeps.
> -Greg
> 
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com