Re: Ceph Pacific bluefs enospc bug with newly created OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Quincy brings support for 4K allocation unit but doesn't start using it immediately. Instead it falls back to 4K when bluefs is unable to allocate more space with the default size. And even this mode isn't permanent, bluefs attempts to bring larger units back from time to time.


Thanks,

Igor

On 22/06/2023 00:04, Fox, Kevin M wrote:
Does quincy automatically switch existing things to 4k or do you need to do a new ost to get the 4k size?

Thanks,
Kevin

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: Wednesday, June 21, 2023 5:56 AM
To: Carsten Grommel; ceph-users@xxxxxxx
Subject:  Re: Ceph Pacific bluefs enospc bug with newly created OSDs

Check twice before you click! This email originated from outside PNNL.


Hi Carsten,

please also note a workaround to bring the osds back for e.g. data
recovery - set bluefs_shared_alloc_size to 32768.

This will hopefully allow OSD to startup and pull data out of it. But I
wouldn't discourage you from using such OSDs long term as fragmentation
might evolve and this workaround will become ineffective as well.

Please do not apply this change to healthy OSDs as it's irreversible.


BTW, having two namespace at NVMe drive is a good alternative to Logical
Volumes if for some reasons one needs two "physical" disks for OSD setup...

Thanks,

Igor

On 21/06/2023 11:41, Carsten Grommel wrote:
Hi Igor,

thank you for your ansere!

first of all Quincy does have a fix for the issue, see
https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
https://tracker.ceph.com/issues/58588)
Thank you I somehow missed that release, good to know!

SSD or HDD? Standalone or shared DB volume? I presume the latter... What
is disk size and current utilization?

Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
possible
We use 4 TB NVMe SSDs, shared db yes and mainly Micron with some Dell
and Samsung in this cluster:

Micron_7400_MTFDKCB3T8TDZ_214733D291B1 cloud5-1561:nvme5n1  osd.5

All Disks are at ~ 88% utilization. I noticed that around 92% our
disks tend to run into this bug.

Here are some bluefs-bdev-sizes from different OSDs on different hosts
in this cluster:

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-36/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec00000 : using 0x2e1b3900000(2.9 TiB)

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-24/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec00000 : using 0x2d4e318d000(2.8 TiB)

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-5/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec00000 : using 0x2f2da93d000(2.9 TiB)

Generally, given my assumption that DB volume is currently collocated
and you still want to stay on Pacific, you might want to consider
redeploying OSDs with a standalone DB volume setup.

Just create large enough (2x of the current DB size seems to be pretty
conservative estimation for that volume's size) additional LV on top of
the same physical disk. And put DB there...

Separating DB from main disk would result in much less fragmentation at
DB volume and hence work around the problem. The cost would be having
some extra spare space at DB volume unavailable for user data .
I guess that makes, so the suggestion would be to deploy the osd and
db on the same NVMe

but with different logical volumes or updating to quincy.

Thank you!

Carsten

*Von: *Igor Fedotov <igor.fedotov@xxxxxxxx>
*Datum: *Dienstag, 20. Juni 2023 um 12:48
*An: *Carsten Grommel <c.grommel@xxxxxxxxxxxx>, ceph-users@xxxxxxx
<ceph-users@xxxxxxx>
*Betreff: *Re:  Ceph Pacific bluefs enospc bug with newly
created OSDs

Hi Carsten,

first of all Quincy does have a fix for the issue, see
https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
https://tracker.ceph.com/issues/58588)

Could you please share a bit more info on OSD disk layout?

SSD or HDD? Standalone or shared DB volume? I presume the latter... What
is disk size and current utilization?

Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
possible


Generally, given my assumption that DB volume is currently collocated
and you still want to stay on Pacific, you might want to consider
redeploying OSDs with a standalone DB volume setup.

Just create large enough (2x of the current DB size seems to be pretty
conservative estimation for that volume's size) additional LV on top of
the same physical disk. And put DB there...

Separating DB from main disk would result in much less fragmentation at
DB volume and hence work around the problem. The cost would be having
some extra spare space at DB volume unavailable for user data .


Hope this helps,

Igor


On 20/06/2023 10:29, Carsten Grommel wrote:
Hi all,

we are experiencing the “bluefs enospc bug” again after redeploying
all OSDs of our Pacific Cluster.
I know that our cluster is a bit too utilized at the moment with
87.26 % raw usage but still this should not happen afaik.
We never hat this problem with previous ceph versions and right now
I am kind of out of ideas at how to tackle these crashes.
Compacting the database did not help in the past either.
Redeploy seems to no help in the long run as well. For documentation
I used these commands to redeploy the osds:
systemctl stop ceph-osd@${OSDNUM}
ceph osd destroy --yes-i-really-mean-it ${OSDNUM}
blkdiscard ${DEVICE}
sgdisk -Z ${DEVICE}
dmsetup remove ${DMDEVICE}
ceph-volume lvm create --osd-id ${OSDNUM} --data ${DEVICE}

Any ideas or possible solutions on this?  I am not yet ready to
upgrade our clusters to quincy, also I do presume that this bug is
still present in quincy as well?
Follow our cluster information:

Crash Info:
ceph crash info
2023-06-19T21:23:51.285180Z_ac4105d7-cb09-45c8-a6e3-8a6bb6727b25
{
      "assert_condition": "abort",
      "assert_file": "/build/ceph/src/os/bluestore/BlueFS.cc",
      "assert_func": "int BlueFS::_flush_range(BlueFS::FileWriter*,
uint64_t, uint64_t)",
      "assert_line": 2810,
      "assert_msg": "/build/ceph/src/os/bluestore/BlueFS.cc: In
function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t,
uint64_t)' thread 7fd561810100 time
2023-06-19T23:23:51.261617+0200\n/build/ceph/src/os/bluestore/BlueFS.cc:
2810: ceph_abort_msg(\"bluefs enospc\")\n",
      "assert_thread_name": "ceph-osd",
      "backtrace": [
"/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7fd56225f730]",
          "gsignal()",
          "abort()",
          "(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1a7) [0x557bb3c65762]",
"(BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
long)+0x1175) [0x557bb42e7945]",
          "(BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0xa1)
[0x557bb42e7ad1]",
          "(BlueFS::_flush(BlueFS::FileWriter*, bool,
std::unique_lock<std::mutex>&)+0x2e) [0x557bb42f803e]",
"(BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b)
[0x557bb431134b]",
"(rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&,
rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x44)
[0x557bb478e602]",
"(rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned
long)+0x333) [0x557bb4956feb]",
"(rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x5d1)
[0x557bb4955569]",
"(rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice
const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0x11d)
[0x557bb4b142e1]",
"(rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&,
rocksdb::BlockHandle*, bool)+0x7d6) [0x557bb4b140ca]",
"(rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*,
rocksdb::BlockHandle*, bool)+0x48) [0x557bb4b138e0]",
"(rocksdb::BlockBasedTableBuilder::Flush()+0x9a) [0x557bb4b13890]",
"(rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&,
rocksdb::Slice const&)+0x192) [0x557bb4b133c8]",
"(rocksdb::BuildTable(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*,
rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&,
rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&,
rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*,
std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator,
std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >,
std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator,
std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >,
rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&,
std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory,
std::default_delete<rocksdb::IntTblPropCollectorFactory> >,
std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory,
std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*,
unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, std::vector<unsigned long,
std::allocator<unsigned long> >, unsigned long,
rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long,
rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*,
rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int,
rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned
long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned
long)+0x773) [0x557bb4a9aa7d]",
"(rocksdb::DBImpl::WriteLevel0TableForRecovery(int,
rocksdb::ColumnFamilyData*, rocksdb::MemTable*,
rocksdb::VersionEdit*)+0x5de) [0x557bb4824676]",
"(rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long,
std::allocator<unsigned long> > const&, unsigned long*, bool,
bool*)+0x1aa0) [0x557bb48232d0]",
"(rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor,
std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool,
bool, unsigned long*)+0x158a) [0x557bb4820846]",
"(rocksdb::DBImpl::Open(rocksdb::DBOptions const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&,
std::vector<rocksdb::ColumnFamilyDescriptor,
std::allocator<rocksdb::ColumnFamilyDescriptor> > const&,
std::vector<rocksdb::ColumnFamilyHandle*,
std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool,
bool)+0x679) [0x557bb4825b25]",
          "(rocksdb::DB::Open(rocksdb::DBOptions const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&,
std::vector<rocksdb::ColumnFamilyDescriptor,
std::allocator<rocksdb::ColumnFamilyDescriptor> > const&,
std::vector<rocksdb::ColumnFamilyHandle*,
std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x52)
[0x557bb4824efa]",
"(RocksDBStore::do_open(std::ostream&, bool, bool,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xdaf) [0x557bb473b85f]",
          "(BlueStore::_open_db(bool, bool, bool)+0x44b)
[0x557bb41ec20b]",
          "(BlueStore::_open_db_and_around(bool, bool)+0x2ef)
[0x557bb425288f]",
          "(BlueStore::_mount()+0x9c) [0x557bb42551ec]",
          "(OSD::init()+0x38a) [0x557bb3d568da]",
          "main()",
          "__libc_start_main()",
          "_start()"
      ],
      "ceph_version": "16.2.11",
      "crash_id":
"2023-06-19T21:23:51.285180Z_ac4105d7-cb09-45c8-a6e3-8a6bb6727b25",
      "entity_name": "osd.39",
      "os_id": "10",
      "os_name": "Debian GNU/Linux 10 (buster)",
      "os_version": "10 (buster)",
      "os_version_id": "10",
      "process_name": "ceph-osd",
      "stack_sig":
"23f90145bebe39074210d4a79260e8977aec6b1c4d963740d1a04c3ddd4756a4",
      "timestamp": "2023-06-19T21:23:51.285180Z",
      "utsname_hostname": "cloud5-1567",
      "utsname_machine": "x86_64",
      "utsname_release": "5.10.144+1-ph",
      "utsname_sysname": "Linux",
      "utsname_version": "#1 SMP Mon Sep 26 07:02:56 UTC 2022"
}

Utilization:
ceph df
--- RAW STORAGE ---
CLASS     SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    168 TiB  21 TiB  146 TiB   146 TiB 87.26
TOTAL  168 TiB  21 TiB  146 TiB   146 TiB 87.26

--- POOLS ---
POOL                       ID   PGS   STORED OBJECTS     USED
%USED  MAX AVAIL
device_health_metrics       1     1  4.7 MiB       48   14 MiB
0    2.1 TiB
cephstor5                   2  2048   52 TiB 14.27M  146 TiB
95.89    2.1 TiB
cephfs_cephstor5_data       3    32   95 MiB 118.52k  1.4 GiB
0.02    2.1 TiB
cephfs_cephstor5_metadata   4    16  352 MiB 166  1.0 GiB   0.02
2.1 TiB
Versions:
ceph versions
{
      "mon": {
          "ceph version 16.2.11
(3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)": 3
      },
      "mgr": {
          "ceph version 16.2.11
(3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)": 3
      },
      "osd": {
          "ceph version 16.2.11
(3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)": 48
      },
      "mds": {
          "ceph version 16.2.11
(3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)": 3
      },
      "overall": {
          "ceph version 16.2.11
(3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)": 57
      }
}



Kind regards
Carsten Grommel

-------------------------------
Profihost GmbH
Expo Plaza 1
30539 Hannover
Deutschland

Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
URL: http://www.profihost.com/ | E-Mail:
info@xxxxxxxxxxxxx<mailto:info@xxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxx>>
Sitz der Gesellschaft: Hannover, USt-IdNr. DE249338561
Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 222926
Geschäftsführer: Marc Zocher, Dr. Claus Boyens, Daniel Hagemeier
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux