Am 26.02.2018 um 19:56 schrieb Gregory Farnum: > > > On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote: > > Am 26.02.2018 um 16:59 schrieb Patrick Donnelly: > > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth > > <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote: > >> Looking with: > >> ceph daemon osd.2 perf dump > >> I get: > >> "bluefs": { > >> "gift_bytes": 0, > >> "reclaim_bytes": 0, > >> "db_total_bytes": 84760592384, > >> "db_used_bytes": 78920024064, > >> "wal_total_bytes": 0, > >> "wal_used_bytes": 0, > >> "slow_total_bytes": 0, > >> "slow_used_bytes": 0, > >> so it seems this is almost exclusively RocksDB usage. > >> > >> Is this expected? > > > > Yes. The directory entries are stored in the omap of the objects. This > > will be stored in the RocksDB backend of Bluestore. > > > >> Is there a recommendation on how much MDS storage is needed for a CephFS with 450 TB? > > > > It seems in the above test you're using about 1KB per inode (file). > > Using that you can extrapolate how much space the data pool needs > > based on your file system usage. (If all you're doing is filling the > > file system with empty files, of course you're going to need an > > unusually large metadata pool.) > > > Many thanks, this helps! > We naturally hope our users will not do this, this stress test was a worst case - > but the rough number (1 kB per inode) does indeed help a lot, and also the increase with modifications > of the file as laid out by David. > > Is also the slow backfilling normal? > Will such increase in storage (by many file modifications) at some point also be reduced, i.e. > is the database compacted / can one trigger that / is there something like "SQL vacuum"? > > To also answer David's questions in parallel: > - Concerning the slow backfill, I am only talking about the "metadata OSDs". > They are fully SSD backed, and have no separate device for block.db / WAL. > - I adjusted backfills up to 128 for those metadata OSDs, the cluster is currently fully empty, i.e. no client's are doing anything. > There are no slow requests. > Since no clients are doing anything and the rest of the cluster is now clean (apart from the two backfilling OSDs), > right now there is also no memory pressure at all. > The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each. > The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s of write throughput. > Network traffic between the node with the clean OSDs and the "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly more bandwidth available... > - Checking sleeps with: > # ceph -n osd.1 --show-config | grep sleep > osd_recovery_sleep = 0.000000 > osd_recovery_sleep_hdd = 0.100000 > osd_recovery_sleep_hybrid = 0.025000 > osd_recovery_sleep_ssd = 0.000000 > shows there should be 0 sleep. Or is there another way to query? > > > Check if the OSDs are reporting their stores or their journals to be "rotational" via "ceph osd metadata"? I find: "bluestore_bdev_model": "Micron_5100_MTFD", "bluestore_bdev_partition_path": "/dev/sda2", "bluestore_bdev_rotational": "0", "bluestore_bdev_size": "239951482880", "bluestore_bdev_type": "ssd", [...] "rotational": "0" for all of them (obviously with different device paths). Also, they've been assigned the ssd device class automatically: # ceph osd df | head ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 ssd 0.21829 1.00000 223G 11310M 212G 4.94 0.94 0 1 ssd 0.21829 1.00000 223G 11368M 212G 4.97 0.95 0 2 ssd 0.21819 1.00000 223G 76076M 149G 33.25 6.35 128 3 ssd 0.21819 1.00000 223G 76268M 148G 33.33 6.37 128 So this should not be the reason... > > If that's being detected wrong, that would cause them to be using those sleeps. > -Greg > >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com