On 10/6/2021 2:16 PM, José H. Freidhof wrote:
Hi Igor,
yes i have some osd settings set :-) here are my ceph config dump. those
settings are from a redhat document for bluestore devices
maybe it is that setting causing this problem? "advanced
mon_compact_on_trim false"???
OMG!!!
No - mon_compact_on_trim has nothing to deal with bluestore.
Highly likely it's bluestore_rocksdb_options which hurts...
Documentations tend to fall behind the best practices.... I would
strongly discourage you from using non-default settings unless it's
absolutely clear why this is necessary.
Even at the first glance the following settings (just a few ones I'm
completely aware) are suboptimal/non-recommended:
rocksdb_perf
bluefs_sync_write
bluefs_csum_type
Not to mention bluestore_rocksdb_options which hasn't got much adoption
so far and apparently greatly alters rocksdb behavior...
So I would suggest to revert rocksdb options back to default, run the
compaction and if it succeeds monitor the OSD for a while. Then if it
works fine - apply the same for others
Hope this helps,
Igor
i will test it this afternoon... at the moment are everything semi
prodcuctive and i need to repair one osd node.. because i think of this
reason the osds crashed on the node and the osd container crashes with a
dump while coming up now.
need first to replicate all between all three nodes and then i can take
offline the osd.2.and test your command. i will inform you later...
root@cd88-ceph-osdh-01:/# ceph config dump
WHO MASK
LEVEL OPTION VALUE
RO
global
advanced leveldb_max_open_files 131072
global
advanced mon_compact_on_trim false
global
dev ms_crc_data false
global
advanced osd_deep_scrub_interval 1209600.000000
global
advanced osd_max_scrubs 16
global
advanced osd_scrub_load_threshold 0.010000
global
advanced osd_scrub_max_interval 1209600.000000
global
advanced osd_scrub_min_interval 86400.000000
global
advanced perf true
global
advanced rbd_readahead_disable_after_bytes 0
global
advanced rbd_readahead_max_bytes 4194304
global
advanced rocksdb_perf true
global
advanced throttler_perf_counter false
mon
advanced auth_allow_insecure_global_id_reclaim false
mon
advanced cluster_network 10.50.50.0/24
*
mon
advanced mon_osd_down_out_interval 300
mon
advanced public_network 10.50.50.0/24
*
mgr
advanced mgr/cephadm/container_init True
*
mgr
advanced mgr/cephadm/device_enhanced_scan true
*
mgr
advanced mgr/cephadm/migration_current 2
*
mgr
advanced mgr/cephadm/warn_on_stray_daemons false
*
mgr
advanced mgr/cephadm/warn_on_stray_hosts false
*
osd
advanced bluefs_sync_write true
osd
dev bluestore_cache_autotune true
osd
dev bluestore_cache_kv_ratio 0.200000
osd
dev bluestore_cache_meta_ratio 0.800000
osd
dev bluestore_cache_size 2147483648
osd
dev bluestore_cache_size_hdd 2147483648
osd
advanced bluestore_csum_type none
osd
dev bluestore_extent_map_shard_max_size 200
osd
dev bluestore_extent_map_shard_min_size 50
osd
dev bluestore_extent_map_shard_target_size 100
osd
advanced bluestore_rocksdb_options
compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,max_bytes_for_level_base=6GB,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB
*
osd
advanced mon_osd_cache_size 1024
osd
dev ms_crc_data false
osd
advanced osd_map_share_max_epochs 5
osd
advanced osd_max_backfills 1
osd
dev osd_max_pg_log_entries 10
osd
dev osd_memory_cache_min 3000000000
osd
host:cd133-ceph-osdh-01
basic osd_memory_target 5797322383
osd
host:cd133k-ceph-osdh-01
basic osd_memory_target 9402402385
osd host:cd88-ceph-osdh-01
basic osd_memory_target 5797322096
osd
advanced osd_memory_target_autotune true
osd
dev osd_min_pg_log_entries 10
osd
advanced osd_op_num_shards 8
*
osd
advanced osd_op_num_threads_per_shard 2
*
osd
dev osd_pg_log_dups_tracked 10
osd
dev osd_pg_log_trim_min 10
osd
advanced osd_recovery_max_active 3
osd
advanced osd_recovery_max_single_start 1
osd
advanced osd_recovery_sleep 0.000000
Am Mi., 6. Okt. 2021 um 12:55 Uhr schrieb Igor Fedotov <ifedotov@xxxxxxx
:
Jose,
In fact 48GB is a way too much for WAL drive - usually the write ahead
log
tend to be 2-4 GBs.
But in your case it's ~150GB, while DB itself is very small (146MB!!!):
WAL 45 GiB 111 GiB 0 B 0 B 0 B
154 GiB 2400
DB 0 B 164 MiB 0 B 0 B 0 B
146 MiB 30
which means that there are some issues with RocksDB's WAL processing,
which needs some troubleshooting...
Curious if other OSDs are suffering from the same and whether you have
any
custom settings for your OSD(s)?
Additionally you might want to try the following command to compact this
specific OSD manually and check if this would normalize the DB layout -
the
majority of data has to be at DB level not WAL. Please share the
resulting
layout (reported by "ceph daemon osd.2 bluefs stats" command) after the
compaction is fulfiled and OSD is restarted.
The compaction command to be applied on an offline OSD:
"ceph-kvstore-tool
bluestore-kv <path-to-osd> compact"
Even if the above works great please refrain from applying that
compaction
to every OSD - let's see how that "compacted" OSD evolves.Would WAL grow
again or not?
Thanks,
Igor
On 10/6/2021 1:35 PM, José H. Freidhof wrote:
Hello Igor,
yes the volume is nvme wal partitions for the bluestore devicegroups
are
only 48gb each
on each osd node are 1 nvme with 1tb splitted in 20 lvs with 48gb (WAL)
on each osd node are 4 ssd with 1tb splitted in 5 lvs with 175gb
(rock.db)
on each osd node are 20 hdd with 5.5tb with 1 lvs (block.db)
each blustore have 1 partition nvme,ssd and hdd like described in the
documentation
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/
is this to small or can i adjust the max allocation on the wal nvme
device
in the ceph configuration?
i know that the ssd and nvme are to small for those 5.5tb disk... its 1%
only ot the rotation disk.
i am new in ceph and still or always learning, but we are in a little
hurry because our other datastores are old and full.
root@cd88-ceph-osdh-01:/# ceph daemon osd.2 bluestore bluefs device
info
{
"dev": {
"device": "BDEV_WAL",
"total": 48318377984,
"free": 1044480,
"bluefs_used": 48317333504
},
"dev": {
"device": "BDEV_DB",
"total": 187904811008,
"free": 68757217280,
"bluefs_used": 119147593728
},
"dev": {
"device": "BDEV_SLOW",
"total": 6001172414464,
"free": 5624912359424,
"bluefs_used": 0,
"bluefs max available": 5624401231872
}
}
root@cd88-ceph-osdh-01:/# ceph daemon osd.2 bluefs stats
0 : device size 0xb3ffff000 : using 0xb3ff00000(45 GiB)
1 : device size 0x2bbfffe000 : using 0x1bbeb00000(111 GiB)
2 : device size 0x57541c00000 : using 0x579b592000(350 GiB)
RocksDBBlueFSVolumeSelector: wal_total:45902462976,
db_total:178509578240,
slow_total:5701113793740, db_avail:103884521472
Usage matrix:
DEV/LEV WAL DB SLOW * *
REAL FILES
LOG 124 MiB 2.3 GiB 0 B 0 B 0 B
7.5 MiB 1
WAL 45 GiB 111 GiB 0 B 0 B 0 B
154 GiB 2400
DB 0 B 164 MiB 0 B 0 B 0 B
146 MiB 30
SLOW 0 B 0 B 0 B 0 B 0 B
0
B 0
TOTALS 45 GiB 113 GiB 0 B 0 B 0 B
0
B 2431
MAXIMUMS:
LOG 124 MiB 2.3 GiB 0 B 0 B 0 B
17
MiB
WAL 45 GiB 149 GiB 0 B 0 B 0 B
192 GiB
DB 0 B 762 MiB 0 B 0 B 0 B
741 MiB
SLOW 0 B 0 B 0 B 0 B 0 B
0 B
TOTALS 45 GiB 150 GiB 0 B 0 B 0 B
0 B
Am Mi., 6. Okt. 2021 um 11:45 Uhr schrieb Igor Fedotov <
ifedotov@xxxxxxx>:
Hey Jose,
it looks like your WAL volume is out of space which looks weird given
its capacity = 48Gb.
Could you please share the output of the following commands:
ceph daemon osd.N bluestore bluefs device info
ceph daemon osd.N bluefs stats
Thanks,
Igor
On 10/6/2021 12:24 PM, José H. Freidhof wrote:
Hello together
we have a running ceph pacific 16.2.5 cluster and i found this
messages
in
the service logs of the osd daemons.
we have three osd nodes .. each node has 20osds as bluestore with
nvme/ssd/hdd
is this a bug or maybe i have some settings wrong?
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:25.821+0000
7f38eebd4700 1 bluefs _allocate unable to allocate 0x100000 on bdev
0,
allocator name bluefs-wal, allocator type hybrid, capacity
0xb40000000,
block size 0x100000, free 0xff000, fragmentation 0, allocated 0x0
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:29.857+0000
7f38eebd4700 1 bluefs _allocate unable to allocate 0x100000 on bdev
0,
allocator name bluefs-wal, allocator type hybrid, capacity
0xb40000000,
block size 0x100000, free 0xff000, fragmentation 0, allocated 0x0
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.073+0000
7f38eebd4700 1 bluefs _allocate unable to allocate 0x400000 on bdev
0,
allocator name bluefs-wal, allocator type hybrid, capacity
0xb40000000,
block size 0x100000, free 0xff000, fragmentation 0, allocated 0x0
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.405+0000
7f38eebd4700 1 bluefs _allocate unable to allocate 0x100000 on bdev
0,
allocator name bluefs-wal, allocator type hybrid, capacity
0xb40000000,
block size 0x100000, free 0xff000, fragmentation 0, allocated 0x0
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.465+0000
7f38eebd4700 1 bluefs _allocate unable to allocate 0x100000 on bdev
0,
allocator name bluefs-wal, allocator type hybrid, capacity
0xb40000000,
block size 0x100000, free 0xff000, fragmentation 0, allocated 0x0
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.529+0000
7f38eebd4700 1 bluefs _allocate unable to allocate 0x100000 on bdev
0,
allocator name bluefs-wal, allocator type hybrid, capacity
0xb40000000,
block size 0x100000, free 0xff000, fragmentation 0, allocated 0x0
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.545+0000
7f38eebd4700 4 rocksdb: [db_impl/db_impl_write.cc:1668] [L] New
memtable
created with log file: #9588. Immutable memtables: 1.
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.545+0000
7f38eebd4700 1 bluefs _allocate unable to allocate 0x100000 on bdev
0,
allocator name bluefs-wal, allocator type hybrid, capacity
0xb40000000,
block size 0x100000, free 0xff000, fragmentation 0, allocated 0x0
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.545+0000
7f3905c02700 4 rocksdb: (Original Log Time
2021/10/06-09:17:30.547575)
[db_impl/db_impl_compaction_flush.cc:2198] Calling
FlushMemTableToOutputFile with column family [L], flush slots
available
1,
compaction slots available 1, flush slots scheduled 1, compaction
slots
scheduled 0
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.545+0000
7f3905c02700 4 rocksdb: [flush_job.cc:321] [L] [JOB 5709] Flushing
memtable with next log file: 9587
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.545+0000
7f3905c02700 4 rocksdb: [flush_job.cc:321] [L] [JOB 5709] Flushing
memtable with next log file: 9588
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.545+0000
7f3905c02700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850547916,
"job": 5709, "event": "flush_started", "num_memtables": 2,
"num_entries":
4146, "num_deletes": 0, "total_data_size": 127203926, "memory_usage":
130479920, "flush_reason": "Write Buffer Full"}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.545+0000
7f3905c02700 4 rocksdb: [flush_job.cc:350] [L] [JOB 5709] Level-0
flush
table #9589: started
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.557+0000
7f3905c02700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850559292,
"cf_name": "L", "job": 5709, "event": "table_file_creation",
"file_number":
9589, "file_size": 3249934, "table_properties": {"data_size": 3247855,
"index_size": 1031, "index_partitions": 0, "top_level_index_size": 0,
"index_key_is_user_key": 0, "index_value_is_delta_encoded": 0,
"filter_size": 197, "raw_key_size": 1088, "raw_average_key_size": 16,
"raw_value_size": 3246252, "raw_average_value_size": 47739,
"num_data_blocks": 36, "num_entries": 68, "num_deletions": 32,
"num_merge_operands": 0, "num_range_deletions": 0, "format_version":
0,
"fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter",
"column_family_name": "L", "column_family_id": 10, "comparator":
"leveldb.BytewiseComparator", "merge_operator": "nullptr",
"prefix_extractor_name": "nullptr", "property_collectors": "[]",
"compression": "NoCompression", "compression_options":
"window_bits=-14;
level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0;
enabled=0; ", "creation_time": 1633511730, "oldest_key_time":
1633511730,
"file_creation_time": 1633511850}}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.557+0000
7f3905c02700 4 rocksdb: [flush_job.cc:401] [L] [JOB 5709] Level-0
flush
table #9589: 3249934 bytes OK
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.557+0000
7f3905c02700 4 rocksdb: (Original Log Time
2021/10/06-09:17:30.559362)
[memtable_list.cc:447] [L] Level-0 commit table #9589 started
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.557+0000
7f3905c02700 4 rocksdb: (Original Log Time
2021/10/06-09:17:30.559583)
[memtable_list.cc:503] [L] Level-0 commit table #9589: memtable #1
done
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.557+0000
7f3905c02700 4 rocksdb: (Original Log Time
2021/10/06-09:17:30.559586)
[memtable_list.cc:503] [L] Level-0 commit table #9589: memtable #2
done
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.557+0000
7f3905c02700 4 rocksdb: (Original Log Time
2021/10/06-09:17:30.559601)
EVENT_LOG_v1 {"time_micros": 1633511850559593, "job": 5709, "event":
"flush_finished", "output_compression": "NoCompression", "lsm_state":
[8,
1, 0, 0, 0, 0, 0], "immutable_memtables": 0}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.557+0000
7f3905c02700 4 rocksdb: (Original Log Time
2021/10/06-09:17:30.559638)
[db_impl/db_impl_compaction_flush.cc:205] [L] Level summary: files[8 1
0 0
0 0 0] max score 1.00
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.557+0000
7f38fb3ed700 4 rocksdb: [compaction/compaction_job.cc:1676] [L] [JOB
5710]
Compacting 8@0 + 1@1 files to L1, score 1.00
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.557+0000
7f38fb3ed700 4 rocksdb: [compaction/compaction_job.cc:1680] [L]
Compaction
start summary: Base version 3090 Base level 0, inputs: [9589(3173KB)
9586(4793KB) 9583(1876KB) 9580(194KB) 9576(6417KB) 9573(1078KB)
9570(405KB)
9567(29KB)], [9564(1115KB)]
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.557+0000
7f38fb3ed700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850559956,
"job": 5710, "event": "compaction_started", "compaction_reason":
"LevelL0FilesNum", "files_L0": [9589, 9586, 9583, 9580, 9576, 9573,
9570,
9567], "files_L1": [9564], "score": 1, "input_data_size": 19542092}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: [compaction/compaction_job.cc:1349] [L] [JOB
5710]
Generated table #9590: 36 keys, 3249524 bytes
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850582987,
"cf_name": "L", "job": 5710, "event": "table_file_creation",
"file_number":
9590, "file_size": 3249524, "table_properties": {"data_size": 3247449,
"index_size": 1031, "index_partitions": 0, "top_level_index_size": 0,
"index_key_is_user_key": 0, "index_value_is_delta_encoded": 0,
"filter_size": 197, "raw_key_size": 576, "raw_average_key_size": 16,
"raw_value_size": 3246252, "raw_average_value_size": 90173,
"num_data_blocks": 36, "num_entries": 36, "num_deletions": 0,
"num_merge_operands": 0, "num_range_deletions": 0, "format_version":
0,
"fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter",
"column_family_name": "L", "column_family_id": 10, "comparator":
"leveldb.BytewiseComparator", "merge_operator": "nullptr",
"prefix_extractor_name": "nullptr", "property_collectors": "[]",
"compression": "NoCompression", "compression_options":
"window_bits=-14;
level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0;
enabled=0; ", "creation_time": 1633471854, "oldest_key_time": 0,
"file_creation_time": 1633511850}}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: [compaction/compaction_job.cc:1415] [L] [JOB
5710]
Compacted 8@0 + 1@1 files to L1 => 3249524 bytes
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: (Original Log Time
2021/10/06-09:17:30.583469)
[compaction/compaction_job.cc:760] [L] compacted to: files[0 1 0 0 0 0
0]
max score 0.01, MB/sec: 846.1 rd, 140.7 wr, level 1, files in(8, 1)
out(1)
MB in(17.5, 1.1) out(3.1), read-write-amplify(1.2) write-amplify(0.2)
OK,
records in: 376, records dropped: 340 output_compression:
NoCompression
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: (Original Log Time
2021/10/06-09:17:30.583498)
EVENT_LOG_v1 {"time_micros": 1633511850583485, "job": 5710, "event":
"compaction_finished", "compaction_time_micros": 23098,
"compaction_time_cpu_micros": 20039, "output_level": 1,
"num_output_files":
1, "total_output_size": 3249524, "num_input_records": 376,
"num_output_records": 36, "num_subcompactions": 1,
"output_compression":
"NoCompression", "num_single_delete_mismatches": 0,
"num_single_delete_fallthrough": 0, "lsm_state": [0, 1, 0, 0, 0, 0,
0]}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850583615,
"job": 5710, "event": "table_file_deletion", "file_number": 9589}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850583648,
"job": 5710, "event": "table_file_deletion", "file_number": 9586}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850583675,
"job": 5710, "event": "table_file_deletion", "file_number": 9583}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850583709,
"job": 5710, "event": "table_file_deletion", "file_number": 9580}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850583739,
"job": 5710, "event": "table_file_deletion", "file_number": 9576}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850583769,
"job": 5710, "event": "table_file_deletion", "file_number": 9573}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850583804,
"job": 5710, "event": "table_file_deletion", "file_number": 9570}
cd88-ceph-osdh-01 bash[6283]: debug 2021-10-06T09:17:30.581+0000
7f38fb3ed700 4 rocksdb: EVENT_LOG_v1 {"time_micros":
1633511850583835,
"job": 5710, "event": "table_file_deletion", "file_number": 9567}
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
Mit freundlichen Grüßen,
-
José H. Freidhof
Reyerhütterstrasse 130b
41065 Mönchengladbach
eMail: harald.freidhof@xxxxxxxxx
mobil: +49 (0) 1523 – 717 7801