Am 26.02.2018 um 20:09 schrieb Oliver Freyermuth: > Am 26.02.2018 um 19:56 schrieb Gregory Farnum: >> >> >> On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote: >> >> Am 26.02.2018 um 16:59 schrieb Patrick Donnelly: >> > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth >> > <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote: >> >> Looking with: >> >> ceph daemon osd.2 perf dump >> >> I get: >> >> "bluefs": { >> >> "gift_bytes": 0, >> >> "reclaim_bytes": 0, >> >> "db_total_bytes": 84760592384, >> >> "db_used_bytes": 78920024064, >> >> "wal_total_bytes": 0, >> >> "wal_used_bytes": 0, >> >> "slow_total_bytes": 0, >> >> "slow_used_bytes": 0, >> >> so it seems this is almost exclusively RocksDB usage. >> >> >> >> Is this expected? >> > >> > Yes. The directory entries are stored in the omap of the objects. This >> > will be stored in the RocksDB backend of Bluestore. >> > >> >> Is there a recommendation on how much MDS storage is needed for a CephFS with 450 TB? >> > >> > It seems in the above test you're using about 1KB per inode (file). >> > Using that you can extrapolate how much space the data pool needs >> > based on your file system usage. (If all you're doing is filling the >> > file system with empty files, of course you're going to need an >> > unusually large metadata pool.) >> > >> Many thanks, this helps! >> We naturally hope our users will not do this, this stress test was a worst case - >> but the rough number (1 kB per inode) does indeed help a lot, and also the increase with modifications >> of the file as laid out by David. >> >> Is also the slow backfilling normal? >> Will such increase in storage (by many file modifications) at some point also be reduced, i.e. >> is the database compacted / can one trigger that / is there something like "SQL vacuum"? >> >> To also answer David's questions in parallel: >> - Concerning the slow backfill, I am only talking about the "metadata OSDs". >> They are fully SSD backed, and have no separate device for block.db / WAL. >> - I adjusted backfills up to 128 for those metadata OSDs, the cluster is currently fully empty, i.e. no client's are doing anything. >> There are no slow requests. >> Since no clients are doing anything and the rest of the cluster is now clean (apart from the two backfilling OSDs), >> right now there is also no memory pressure at all. >> The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each. >> The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s of write throughput. >> Network traffic between the node with the clean OSDs and the "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly more bandwidth available... >> - Checking sleeps with: >> # ceph -n osd.1 --show-config | grep sleep >> osd_recovery_sleep = 0.000000 >> osd_recovery_sleep_hdd = 0.100000 >> osd_recovery_sleep_hybrid = 0.025000 >> osd_recovery_sleep_ssd = 0.000000 >> shows there should be 0 sleep. Or is there another way to query? >> >> >> Check if the OSDs are reporting their stores or their journals to be "rotational" via "ceph osd metadata"? > > I find: > "bluestore_bdev_model": "Micron_5100_MTFD", > "bluestore_bdev_partition_path": "/dev/sda2", > "bluestore_bdev_rotational": "0", > "bluestore_bdev_size": "239951482880", > "bluestore_bdev_type": "ssd", > [...] > "rotational": "0" > > for all of them (obviously with different device paths). > Also, they've been assigned the ssd device class automatically: > # ceph osd df | head > ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS > 0 ssd 0.21829 1.00000 223G 11310M 212G 4.94 0.94 0 > 1 ssd 0.21829 1.00000 223G 11368M 212G 4.97 0.95 0 > 2 ssd 0.21819 1.00000 223G 76076M 149G 33.25 6.35 128 > 3 ssd 0.21819 1.00000 223G 76268M 148G 33.33 6.37 128 > > So this should not be the reason... > Checking again with the nice "grep" expression from the other thread concerning bluestore backfilling... # ceph osd metadata | grep 'id\|rotational' yields: "id": 0, "bluefs_db_rotational": "0", "bluestore_bdev_rotational": "0", "journal_rotational": "1", "rotational": "0" "id": 1, "bluefs_db_rotational": "0", "bluestore_bdev_rotational": "0", "journal_rotational": "1", "rotational": "0" "id": 2, "bluefs_db_rotational": "0", "bluestore_bdev_rotational": "0", "journal_rotational": "1", "rotational": "0" "id": 3, "bluefs_db_rotational": "0", "bluestore_bdev_rotational": "0", "journal_rotational": "1", "rotational": "0" "id": 4, "bluefs_db_rotational": "0", "bluefs_slow_rotational": "1", "bluestore_bdev_rotational": "1", "journal_rotational": "1", "rotational": "1" 0-3 are pure SSDs, there is no separate block.db device. Is "journal_rotational" really relevant for bluestore, though? If so, detection seems broken... For comparison, 4 is a HDD with block.db on an SSD. Cheers, Oliver >> >> If that's being detected wrong, that would cause them to be using those sleeps. >> -Greg >> >> > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com