On 10/1/2018 5:03 PM, Sergey Malinin wrote:
Before I received your response, I had already
added 20GB to the OSD (by epanding LV followed by
bluefs-bdev-expand) and ran "ceph-kvstore-tool
bluestore-kv <path> compact", however it still
needs more space.
Is that because I didn't update DB size with
set-label-key?
In mimic you need to run both "bluefs-bdev-expand"
and "set-label-key" command to commit bluefs volume
expansion.
Unfortunately the last command doesn't handle
"size" label properly. That's why you might need to
backport and rebuild with the mentioned commits.
What exactly is the label-key that needs to be
updated, as I couldn't find which one is related
to DB:
# ceph-bluestore-tool show-label --path
/var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
{
"/var/lib/ceph/osd/ceph-1/block": {
"osd_uuid":
"f8f122ee-70a6-4c54-8eb0-9b42205b1ecc",
"size": 471305551872,
"btime": "2018-07-31 03:06:43.751243",
"description": "main",
"bluefs": "1",
"ceph_fsid":
"7d320499-5b3f-453e-831f-60d4db9a4533",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "XXX",
"ready": "ready",
"whoami": "1"
}
}
'size' label but your output is for block(aka slow)
device.
It should return labels for db/wal devices as well
(block.db and block.wal symlinks respectively). It
works for me in master, can't verify with mimic at
the moment though..
Here is output for master:
# bin/ceph-bluestore-tool show-label --path dev/osd0
inferring bluefs devices from bluestore path
{
"dev/osd0/block": {
"osd_uuid":
"404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
"size": 21474836480,
"btime": "2018-09-10 15:55:09.044039",
"description": "main",
"bluefs": "1",
"ceph_fsid":
"56eddc15-11b9-4e0b-9192-e391fbae551c",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key":
"AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==",
"ready": "ready",
"whoami": "0"
},
"dev/osd0/block.wal": {
"osd_uuid":
"404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
"size": 1048576000,
"btime": "2018-09-10 15:55:09.044985",
"description": "bluefs wal"
},
"dev/osd0/block.db": {
"osd_uuid":
"404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
"size": 1048576000,
"btime": "2018-09-10 15:55:09.044469",
"description": "bluefs db"
}
}
You can try --dev option instead of --path, e.g.
ceph-bluestore-tool show-label --dev
<path-to-block.db>
On 1.10.2018, at 16:48, Igor Fedotov
<ifedotov@xxxxxxx> wrote:
This looks like a sort of deadlock when BlueFS
needs some additional space to replay the log
left after the crash. Which happens during BlueFS
open.
But such a space (at slow device as DB is full)
is gifted in background during bluefs rebalance
procedure which will occur after the open.
Hence OSDs stuck in permanent crashing..
The only way to recover I can suggest for now is
to expand DB volumes. You can do that with lvm
tools if you have any spare space for that.
Once resized you'll need ceph-bluestore-tool to
indicate volume expansion to BlueFS
(bluefs-bdev-expand command ) and finally update
DB volume size label with set-label-key command.
The latter is a bit tricky for mimic - you might
need to backport
https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b
and rebuild ceph-bluestore-tool. Alternatively
you can backport
https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b
then bluefs expansion and label updates will
occur in a single step.
I'll do these backports in upstream but this will
take some time to pass all the procedures and get
into official mimic release.
Will fire a ticket to fix the original issue as
well.
Thanks,
Igor
On 10/1/2018 3:28 PM, Sergey Malinin wrote:
These are LVM bluestore NVMe SSDs created with
"ceph-volume --lvm prepare --bluestore
/dev/nvme0n1p3" i.e. without specifying wal/db
devices.
OSDs were created with
bluestore_min_alloc_size_ssd=4096, another
modified setting is
bluestore_cache_kv_max=1073741824
DB/block usage collected by prometheus module
for 3 failed and 1 survived OSDs:
ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"}
65493008384.0
ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"}
49013587968.0
ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"}
76834406400.0 --> this one has survived
ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"}
63726157824.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"}
65217232896.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"}
48944381952.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"}
68093476864.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"}
63632834560.0
ceph_osd_stat_bytes{ceph_daemon="osd.0"}
471305551872.0
ceph_osd_stat_bytes{ceph_daemon="osd.1"}
471305551872.0
ceph_osd_stat_bytes{ceph_daemon="osd.2"}
471305551872.0
ceph_osd_stat_bytes{ceph_daemon="osd.3"}
471305551872.0
ceph_osd_stat_bytes_used{ceph_daemon="osd.0"}
222328213504.0
ceph_osd_stat_bytes_used{ceph_daemon="osd.1"}
214472544256.0
ceph_osd_stat_bytes_used{ceph_daemon="osd.2"}
163603996672.0
ceph_osd_stat_bytes_used{ceph_daemon="osd.3"}
212806815744.0
First crashed OSD was doing DB compaction,
others crashed shortly after during backfilling.
Workload was "ceph-data-scan scan_inodes"
filling metadata pool located on these OSDs at
the rate close to 10k objects/second.
Here is the log excerpt of the first crash
occurrence:
2018-10-01 03:27:12.762 7fbf16dd6700 0
bluestore(/var/lib/ceph/osd/ceph-1)
_balance_bluefs_freespace no allocate on
0x80000000 min_alloc_size 0x1000
2018-10-01 03:27:12.886 7fbf1e5e5700 4 rocksdb:
[/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166]
[default] [JOB 24] Generated table #89741:
106356 keys, 68110589 bytes
2018-10-01 03:27:12.886 7fbf1e5e5700 4 rocksdb:
EVENT_LOG_v1 {"time_micros": 1538353632892744,
"cf_name": "default", "job": 24, "event":
"table_file_creation", "file_number": 89741,
"file_size": 68110589, "table_properties":
{"data_size": 67112903, "index_size": 579319,
"filter_size": 417316, "raw_key_size": 6733561,
"raw_average_key_size": 63, "raw_value_size":
60994583, "raw_average_value_size": 573,
"num_data_blocks": 16336, "num_entries": 106356,
"filter_policy_name":
"rocksdb.BuiltinBloomFilter", "kDeletedKeys":
"14444", "kMergeOperands": "0"}}
2018-10-01 03:27:12.934 7fbf1e5e5700 4 rocksdb:
[/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166]
[default] [JOB 24] Generated table #89742: 23214
keys, 16352315 bytes
2018-10-01 03:27:12.934 7fbf1e5e5700 4 rocksdb:
EVENT_LOG_v1 {"time_micros": 1538353632938670,
"cf_name": "default", "job": 24, "event":
"table_file_creation", "file_number": 89742,
"file_size": 16352315, "table_properties":
{"data_size": 16116986, "index_size": 139894,
"filter_size": 94386, "raw_key_size": 1470883,
"raw_average_key_size": 63, "raw_value_size":
14775006, "raw_average_value_size": 636,
"num_data_blocks": 3928, "num_entries": 23214,
"filter_policy_name":
"rocksdb.BuiltinBloomFilter", "kDeletedKeys":
"90", "kMergeOperands": "0"}}
2018-10-01 03:27:13.042 7fbf1e5e5700 1 bluefs
_allocate failed to allocate 0x4100000 on bdev
1, free 0x1a00000; fallback to bdev 2
2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs
_allocate failed to allocate 0x4100000 on bdev
2, dne
2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs
_flush_range allocated: 0x0 offset: 0x0 length:
0x40ea9f1
2018-10-01 03:27:13.046 7fbf1e5e5700 -1
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc:
In function 'int
BlueFS::_flush_range(BlueFS::FileWriter*,
uint64_t, uint64_t)' thread 7fbf1e5e5700 time
2018-10-01 03:27:13.048298
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc:
1663: FAILED assert(0 == "bluefs enospc")
ceph version 13.2.2
(02899bfda814146b021136e9d8e80eba494e1126) mimic
(stable)
1: (ceph::__ceph_assert_fail(char const*,
char const*, int, char const*)+0x102)
[0x7fbf2d4fe5c2]
2: (()+0x26c787) [0x7fbf2d4fe787]
3:
(BlueFS::_flush_range(BlueFS::FileWriter*,
unsigned long, unsigned long)+0x1ab4)
[0x5619325114b4]
4: (BlueRocksWritableFile::Flush()+0x3d)
[0x561932527c1d]
5:
(rocksdb::WritableFileWriter::Flush()+0x1b9)
[0x56193271c399]
6:
(rocksdb::WritableFileWriter::Sync(bool)+0x3b)
[0x56193271d42b]
7:
(rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
const&,
rocksdb::CompactionJob::SubcompactionState*,
rocksdb::RangeDelAggregator*,
CompactionIterationStats*, rocksdb::Slice
const*)+0x3db) [0x56193276098b]
8:
(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d9)
[0x561932763da9]
9: (rocksdb::CompactionJob::Run()+0x314)
[0x561932765504]
10:
(rocksdb::DBImpl::BackgroundCompaction(bool*,
rocksdb::JobContext*, rocksdb::LogBuffer*,
rocksdb::DBImpl::PrepickedCompaction*)+0xc54)
[0x5619325b5c44]
11:
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
rocksdb::Env::Priority)+0x397) [0x5619325b8557]
12:
(rocksdb::DBImpl::BGWorkCompaction(void*)+0x97)
[0x5619325b8cd7]
13:
(rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned
long)+0x266) [0x5619327a5e36]
14:
(rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x47)
[0x5619327a5fb7]
15: (()+0xbe733) [0x7fbf2b500733]
16: (()+0x76db) [0x7fbf2bbf86db]
17: (clone()+0x3f) [0x7fbf2abbc88f]
NOTE: a copy of the executable, or `objdump
-rdS <executable>` is needed to interpret this.
On 1.10.2018, at 15:01, Igor Fedotov
<ifedotov@xxxxxxx> wrote:
Hi Sergey,
could you please provide more details on your
OSDs ?
What are sizes for DB/block devices?
Do you have any modifications in BlueStore
config settings?
Can you share stats you're referring to?
Thanks,
Igor
On 10/1/2018 12:29 PM, Sergey Malinin wrote:
Hello,
3 of 4 NVME OSDs crashed at the same time on
assert(0 == "bluefs enospc") and no longer start.
Stats collected just before crash show that
ceph_bluefs_db_used_bytes is 100% used.
Although OSDs have over 50% of free space, it
is not reallocated for DB usage.
2018-10-01 12:18:06.744 7f1d6a04d240 1 bluefs
_allocate failed to allocate 0x100000 on bdev
1, free 0x0; fallback to bdev 2
2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs
_allocate failed to allocate 0x100000 on bdev
2, dne
2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs
_flush_range allocated: 0x0 offset: 0x0
length: 0xa8700
2018-10-01 12:18:06.748 7f1d6a04d240 -1
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc:
In function 'int
BlueFS::_flush_range(BlueFS::FileWriter*,
uint64_t, uint64_t)' thread 7f1d6a04d240 time
2018-10-01 12:18:06.746800
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc:
1663: FAILED assert(0 == "bluefs enospc")
ceph version 13.2.2
(02899bfda814146b021136e9d8e80eba494e1126)
mimic (stable)
1: (ceph::__ceph_assert_fail(char const*,
char const*, int, char const*)+0x102)
[0x7f1d6146f5c2]
2: (()+0x26c787) [0x7f1d6146f787]
3:
(BlueFS::_flush_range(BlueFS::FileWriter*,
unsigned long, unsigned long)+0x1ab4)
[0x5586b22684b4]
4: (BlueRocksWritableFile::Flush()+0x3d)
[0x5586b227ec1d]
5:
(rocksdb::WritableFileWriter::Flush()+0x1b9)
[0x5586b2473399]
6:
(rocksdb::WritableFileWriter::Sync(bool)+0x3b)
[0x5586b247442b]
7:
(rocksdb::BuildTable(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >
const&, rocksdb::Env*,
rocksdb::ImmutableCFOptions const&,
rocksdb::MutableCFOptions const&,
rocksdb::EnvOptions const&, rock
sdb::TableCache*, rocksdb::InternalIterator*,
std::unique_ptr<rocksdb::InternalIterator,
std::default_delete<rocksdb::InternalIterator>
>, rocksdb::FileMetaData*,
rocksdb::InternalKeyComparator const&,
std::vector<std::unique_ptr<
rocksdb::IntTblPropCollectorFactory,
std::default_delete<rocksdb::IntTblPropCollectorFactory>
>,
std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory,
std::default_delete<rocksdb::IntTblPropCollectorFactory>
> > > co
nst*, unsigned int,
std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >
const&, std::vector<unsigned long,
std::allocator<unsigned long> >, unsigned
long, rocksdb::SnapshotChecker*,
rocksdb::Compression
Type, rocksdb::CompressionOptions const&,
bool, rocksdb::InternalStats*,
rocksdb::TableFileCreationReason,
rocksdb::EventLogger*, int,
rocksdb::Env::IOPriority,
rocksdb::TableProperties*, int, unsigned long,
unsigned long, rocksdb
::Env::WriteLifeTimeHint)+0x1e24)
[0x5586b249ef94]
8:
(rocksdb::DBImpl::WriteLevel0TableForRecovery(int,
rocksdb::ColumnFamilyData*,
rocksdb::MemTable*,
rocksdb::VersionEdit*)+0xcb7) [0x5586b2321457]
9:
(rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned
long, std::allocator<unsigned long> > const&,
unsigned long*, bool)+0x19de) [0x5586b232373e]
10:
(rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor,
std::allocator<rocksdb::ColumnFamilyDescriptor>
> const&, bool, bool, bool)+0x5d4)
[0x5586b23242f4]
11:
(rocksdb::DBImpl::Open(rocksdb::DBOptions
const&, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >
const&,
std::vector<rocksdb::ColumnFamilyDescriptor,
std::allocator<rocksdb::ColumnFamilyDescri
ptor> > const&,
std::vector<rocksdb::ColumnFamilyHandle*,
std::allocator<rocksdb::ColumnFamilyHandle*>
>*, rocksdb::DB**, bool)+0x68b) [0x5586b232559b]
12: (rocksdb::DB::Open(rocksdb::DBOptions
const&, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >
const&,
std::vector<rocksdb::ColumnFamilyDescriptor,
std::allocator<rocksdb::ColumnFamilyDescriptor
const&,
std::vector<rocksdb::ColumnFamilyHandle*,
std::allocator<rocksdb::ColumnFamilyHandle*>
>*, rocksdb::DB**)+0x22) [0x5586b2326e72]
13: (RocksDBStore::do_open(std::ostream&,
bool, std::vector<KeyValueDB::ColumnFamily,
std::allocator<KeyValueDB::ColumnFamily> >
const*)+0x170c) [0x5586b220219c]
14: (BlueStore::_open_db(bool,
bool)+0xd8e) [0x5586b218ee1e]
15: (BlueStore::_mount(bool, bool)+0x4b7)
[0x5586b21bf807]
16: (OSD::init()+0x295) [0x5586b1d673c5]
17: (main()+0x268d) [0x5586b1c554ed]
18: (__libc_start_main()+0xe7)
[0x7f1d5ea2db97]
19: (_start()+0x2a) [0x5586b1d1d7fa]
NOTE: a copy of the executable, or
`objdump -rdS <executable>` is needed to
interpret this.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com