Re: mimic: 3/4 OSDs crashed on "bluefs enospc"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have rebuilt the tool, but none of my OSDs no matter dead or alive have any symlinks other than 'block' pointing to LVM.
I adjusted main device size but it looks like it needs even more space for db compaction. After executing bluefs-bdev-expand OSD fails to start, however 'fsck' and 'repair' commands finished successfully.

2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
2018-10-01 18:02:39.763 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation metadata
2018-10-01 18:02:40.907 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 2249899 extents
2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs extra 0x[6d6f000000~50c800000]
2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 0x0x55d053fb9180 shutdown
2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all background work
2018-10-01 18:02:40.967 7fc9226c6240  4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete
2018-10-01 18:02:40.971 7fc9226c6240  1 bluefs umount
2018-10-01 18:02:40.975 7fc9226c6240  1 stupidalloc 0x0x55d053883800 shutdown
2018-10-01 18:02:40.975 7fc9226c6240  1 bdev(0x55d053c32e00 /var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.267 7fc9226c6240  1 bdev(0x55d053c32a80 /var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount object store
2018-10-01 18:02:41.443 7fc9226c6240 -1  ** ERROR: osd init failed: (5) Input/output error


> On 1.10.2018, at 18:09, Igor Fedotov <ifedotov@xxxxxxx> wrote:
> 
> Well, actually you can avoid bluestore-tool rebuild.
> 
> You'll need to edit the first chunk of blocks.db where labels are stored. (Please make a backup first!!!)
> 
> Size label is stored at offset 0x52 and is 8 bytes long - little-endian 64bit integer encoding. (Please verify that old value at this offset exactly corresponds to you original volume size and/or 'size' label reported by ceph-bluestore-tool).
> 
> So you have to put new DB volume size there. Or you can send the first 4K chunk (e.g. extracted with dd) along with new DB volume size (in bytes) to me and I'll do that for you.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/1/2018 5:32 PM, Igor Fedotov wrote:
>> 
>> 
>> On 10/1/2018 5:03 PM, Sergey Malinin wrote:
>>> Before I received your response, I had already added 20GB to the OSD (by epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool bluestore-kv <path> compact", however it still needs more space.
>>> Is that because I didn't update DB size with set-label-key?
>> In mimic you need to run both "bluefs-bdev-expand" and "set-label-key" command to commit bluefs volume expansion.
>> Unfortunately the last command doesn't handle "size" label properly. That's why you might need to backport and rebuild with the mentioned commits.
>> 
>>> What exactly is the label-key that needs to be updated, as I couldn't find which one is related to DB:
>>> 
>>> # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1
>>> inferring bluefs devices from bluestore path
>>> {
>>>      "/var/lib/ceph/osd/ceph-1/block": {
>>>          "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc",
>>>          "size": 471305551872,
>>>          "btime": "2018-07-31 03:06:43.751243",
>>>          "description": "main",
>>>          "bluefs": "1",
>>>          "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533",
>>>          "kv_backend": "rocksdb",
>>>          "magic": "ceph osd volume v026",
>>>          "mkfs_done": "yes",
>>>          "osd_key": "XXX",
>>>          "ready": "ready",
>>>          "whoami": "1"
>>>      }
>>> }
>> 'size' label but your output is for block(aka slow) device.
>> 
>> It should return labels for db/wal devices as well (block.db and block.wal symlinks respectively). It works for me in master, can't verify with mimic at the moment though..
>> Here is output for master:
>> 
>> # bin/ceph-bluestore-tool show-label --path dev/osd0
>> inferring bluefs devices from bluestore path
>> {
>>     "dev/osd0/block": {
>>         "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
>>         "size": 21474836480,
>>         "btime": "2018-09-10 15:55:09.044039",
>>         "description": "main",
>>         "bluefs": "1",
>>         "ceph_fsid": "56eddc15-11b9-4e0b-9192-e391fbae551c",
>>         "kv_backend": "rocksdb",
>>         "magic": "ceph osd volume v026",
>>         "mkfs_done": "yes",
>>         "osd_key": "AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==",
>>         "ready": "ready",
>>         "whoami": "0"
>>     },
>>     "dev/osd0/block.wal": {
>>         "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
>>         "size": 1048576000,
>>         "btime": "2018-09-10 15:55:09.044985",
>>         "description": "bluefs wal"
>>     },
>>     "dev/osd0/block.db": {
>>         "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
>>         "size": 1048576000,
>>         "btime": "2018-09-10 15:55:09.044469",
>>         "description": "bluefs db"
>>     }
>> }
>> 
>> 
>> You can try --dev option instead of --path, e.g.
>> ceph-bluestore-tool show-label --dev <path-to-block.db>
>> 
>> 
>>> 
>>>> On 1.10.2018, at 16:48, Igor Fedotov <ifedotov@xxxxxxx> wrote:
>>>> 
>>>> This looks like a sort of deadlock when BlueFS needs some additional space to replay the log left after the crash. Which happens during BlueFS open.
>>>> 
>>>> But such a space (at slow device as DB is full) is gifted in background during bluefs rebalance procedure which will occur after the open.
>>>> 
>>>> Hence OSDs stuck in permanent crashing..
>>>> 
>>>> The only way to recover I can suggest for now is to expand DB volumes. You can do that with lvm tools if you have any spare space for that.
>>>> 
>>>> Once resized you'll need ceph-bluestore-tool to indicate volume expansion to BlueFS (bluefs-bdev-expand command ) and finally update DB volume size label with  set-label-key command.
>>>> 
>>>> The latter is a bit tricky for mimic - you might need to backport https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b
>>>> 
>>>> and rebuild ceph-bluestore-tool. Alternatively you can backport https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b
>>>> 
>>>> then bluefs expansion and label updates will occur in a single step.
>>>> 
>>>> I'll do these backports in upstream but this will take some time to pass all the procedures and get into official mimic release.
>>>> 
>>>> Will fire a ticket to fix the original issue as well.
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Igor
>>>> 
>>>> 
>>>> On 10/1/2018 3:28 PM, Sergey Malinin wrote:
>>>>> These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare --bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices.
>>>>> OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified setting is bluestore_cache_kv_max=1073741824
>>>>> 
>>>>> DB/block usage collected by prometheus module for 3 failed and 1 survived OSDs:
>>>>> 
>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0
>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0
>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 --> this one has survived
>>>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0
>>>>> 
>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0
>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0
>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0
>>>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0
>>>>> 
>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0
>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0
>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.2"} 471305551872.0
>>>>> ceph_osd_stat_bytes{ceph_daemon="osd.3"} 471305551872.0
>>>>> 
>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.0"} 222328213504.0
>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.1"} 214472544256.0
>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 163603996672.0
>>>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.3"} 212806815744.0
>>>>> 
>>>>> 
>>>>> First crashed OSD was doing DB compaction, others crashed shortly after during backfilling. Workload was "ceph-data-scan scan_inodes" filling metadata pool located on these OSDs at the rate close to 10k objects/second.
>>>>> Here is the log excerpt of the first crash occurrence:
>>>>> 
>>>>> 2018-10-01 03:27:12.762 7fbf16dd6700  0 bluestore(/var/lib/ceph/osd/ceph-1) _balance_bluefs_freespace no allocate on 0x80000000 min_alloc_size 0x1000
>>>>> 2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] Generated table #89741: 106356 keys, 68110589 bytes
>>>>> 2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1538353632892744, "cf_name": "default", "job": 24, "event": "table_file_creation", "file_number": 89741, "file_size": 68110589, "table_properties": {"data_size": 67112903, "index_size": 579319, "filter_size": 417316, "raw_key_size": 6733561, "raw_average_key_size": 63, "raw_value_size": 60994583, "raw_average_value_size": 573, "num_data_blocks": 16336, "num_entries": 106356, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "14444", "kMergeOperands": "0"}}
>>>>> 2018-10-01 03:27:12.934 7fbf1e5e5700  4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] Generated table #89742: 23214 keys, 16352315 bytes
>>>>> 2018-10-01 03:27:12.934 7fbf1e5e5700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1538353632938670, "cf_name": "default", "job": 24, "event": "table_file_creation", "file_number": 89742, "file_size": 16352315, "table_properties": {"data_size": 16116986, "index_size": 139894, "filter_size": 94386, "raw_key_size": 1470883, "raw_average_key_size": 63, "raw_value_size": 14775006, "raw_average_value_size": 636, "num_data_blocks": 3928, "num_entries": 23214, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "90", "kMergeOperands": "0"}}
>>>>> 2018-10-01 03:27:13.042 7fbf1e5e5700  1 bluefs _allocate failed to allocate 0x4100000 on bdev 1, free 0x1a00000; fallback to bdev 2
>>>>> 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _allocate failed to allocate 0x4100000 on bdev 2, dne
>>>>> 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x40ea9f1
>>>>> 2018-10-01 03:27:13.046 7fbf1e5e5700 -1 /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fbf1e5e5700 time 2018-10-01 03:27:13.048298
>>>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED assert(0 == "bluefs enospc")
>>>>> 
>>>>>   ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
>>>>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fbf2d4fe5c2]
>>>>>   2: (()+0x26c787) [0x7fbf2d4fe787]
>>>>>   3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1ab4) [0x5619325114b4]
>>>>>   4: (BlueRocksWritableFile::Flush()+0x3d) [0x561932527c1d]
>>>>>   5: (rocksdb::WritableFileWriter::Flush()+0x1b9) [0x56193271c399]
>>>>>   6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x56193271d42b]
>>>>>   7: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*, rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice const*)+0x3db) [0x56193276098b]
>>>>>   8: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d9) [0x561932763da9]
>>>>>   9: (rocksdb::CompactionJob::Run()+0x314) [0x561932765504]
>>>>>   10: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xc54) [0x5619325b5c44]
>>>>>   11: (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x397) [0x5619325b8557]
>>>>>   12: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x5619325b8cd7]
>>>>>   13: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x266) [0x5619327a5e36]
>>>>>   14: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x47) [0x5619327a5fb7]
>>>>>   15: (()+0xbe733) [0x7fbf2b500733]
>>>>>   16: (()+0x76db) [0x7fbf2bbf86db]
>>>>>   17: (clone()+0x3f) [0x7fbf2abbc88f]
>>>>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>> 
>>>>> 
>>>>>> On 1.10.2018, at 15:01, Igor Fedotov <ifedotov@xxxxxxx> wrote:
>>>>>> 
>>>>>> Hi Sergey,
>>>>>> 
>>>>>> could you please provide more details on your OSDs ?
>>>>>> 
>>>>>> What are sizes for DB/block devices?
>>>>>> 
>>>>>> Do you have any modifications in BlueStore config settings?
>>>>>> 
>>>>>> Can you share stats you're referring to?
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Igor
>>>>>> 
>>>>>> 
>>>>>> On 10/1/2018 12:29 PM, Sergey Malinin wrote:
>>>>>>> Hello,
>>>>>>> 3 of 4 NVME OSDs crashed at the same time on assert(0 == "bluefs enospc") and no longer start.
>>>>>>> Stats collected just before crash show that ceph_bluefs_db_used_bytes is 100% used. Although OSDs have over 50% of free space, it is not reallocated for DB usage.
>>>>>>> 
>>>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240  1 bluefs _allocate failed to allocate 0x100000 on bdev 1, free 0x0; fallback to bdev 2
>>>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _allocate failed to allocate 0x100000 on bdev 2, dne
>>>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0xa8700
>>>>>>> 2018-10-01 12:18:06.748 7f1d6a04d240 -1 /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f1d6a04d240 time 2018-10-01 12:18:06.746800
>>>>>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED assert(0 == "bluefs enospc")
>>>>>>> 
>>>>>>>   ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
>>>>>>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f1d6146f5c2]
>>>>>>>   2: (()+0x26c787) [0x7f1d6146f787]
>>>>>>>   3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1ab4) [0x5586b22684b4]
>>>>>>>   4: (BlueRocksWritableFile::Flush()+0x3d) [0x5586b227ec1d]
>>>>>>>   5: (rocksdb::WritableFileWriter::Flush()+0x1b9) [0x5586b2473399]
>>>>>>>   6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x5586b247442b]
>>>>>>>   7: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::EnvOptions const&, rock
>>>>>>> sdb::TableCache*, rocksdb::InternalIterator*, std::unique_ptr<rocksdb::InternalIterator, std::default_delete<rocksdb::InternalIterator> >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<
>>>>>>> rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > co
>>>>>>> nst*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::Compression
>>>>>>> Type, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb
>>>>>>> ::Env::WriteLifeTimeHint)+0x1e24) [0x5586b249ef94]
>>>>>>>   8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcb7) [0x5586b2321457]
>>>>>>>   9: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool)+0x19de) [0x5586b232373e]
>>>>>>>   10: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x5d4) [0x5586b23242f4]
>>>>>>>   11: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescri
>>>>>>> ptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool)+0x68b) [0x5586b232559b]
>>>>>>>   12: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor
>>>>>>>>> const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x22) [0x5586b2326e72]
>>>>>>>   13: (RocksDBStore::do_open(std::ostream&, bool, std::vector<KeyValueDB::ColumnFamily, std::allocator<KeyValueDB::ColumnFamily> > const*)+0x170c) [0x5586b220219c]
>>>>>>>   14: (BlueStore::_open_db(bool, bool)+0xd8e) [0x5586b218ee1e]
>>>>>>>   15: (BlueStore::_mount(bool, bool)+0x4b7) [0x5586b21bf807]
>>>>>>>   16: (OSD::init()+0x295) [0x5586b1d673c5]
>>>>>>>   17: (main()+0x268d) [0x5586b1c554ed]
>>>>>>>   18: (__libc_start_main()+0xe7) [0x7f1d5ea2db97]
>>>>>>>   19: (_start()+0x2a) [0x5586b1d1d7fa]
>>>>>>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux