Hi, this morning I woke up to a degraded test ceph cluster (managed by rook, but it does not really change anything for the question I'm about to ask). After checking logs I have found that bluestore on one of the OSDs run out of space. Some cluster details: ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) it runs on 3 little OSDs 10Gb each `ceph osd df` returned RAW USE of about 4.5GB on every node, happily reporting about 5.5GB of AVAIL. Yet: debug -9> 2020-09-22T20:23:15.421+0000 7f29e9798f40 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1600806195424423, "job": 1, "event": "recovery_started", "log_files": [347, 350]} debug -8> 2020-09-22T20:23:15.421+0000 7f29e9798f40 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #347 mode 0 debug -7> 2020-09-22T20:23:16.465+0000 7f29e9798f40 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #350 mode 0 debug -6> 2020-09-22T20:23:18.689+0000 7f29e9798f40 1 bluefs _allocate failed to allocate 0x17a2360 on bdev 1, free 0x390000; fallback to bdev 2 debug -5> 2020-09-22T20:23:18.689+0000 7f29e9798f40 1 bluefs _allocate unable to allocate 0x17a2360 on bdev 2, free 0xffffffffffffffff; fallback to slow device expander debug -4> 2020-09-22T20:23:18.689+0000 7f29e9798f40 -1 bluestore(/var/lib/ceph/osd/ceph-0) allocate_bluefs_freespace failed to allocate on 0x39a20000 min_size 0x17b0000 > allocated total 0x6250000 bluefs_shared_alloc_size 0x10000 allocated 0x0 available 0x 12ee32000 debug -3> 2020-09-22T20:23:18.689+0000 7f29e9798f40 -1 bluefs _allocate failed to expand slow device to fit +0x17a2360 debug -2> 2020-09-22T20:23:18.689+0000 7f29e9798f40 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x17a2360 debug -1> 2020-09-22T20:23:18.693+0000 7f29e9798f40 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.4/rpm/el8/BUILD/ceph-15.2.4/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f29e9798f40 time 2020-09-22T20:23:18.690014+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.4/rpm/el8/BUILD/ceph-15.2.4/src/os/bluestore/BlueFS.cc: 2696: ceph_abort_msg("bluefs enospc") So, my question would be: how could I have prevented that? From monitoring I have (prometheus) - OSDs are healthy, have plenty of space, yet they are not. What command (and prometheus metric) would help me understand the actual real bluestore use? Or am I missing something? Oh, and I "fixed" the cluster by expanding the broken osd.0 with a larger 15GB volume. And 2 other OSDs still run on 10GB volumes. Thanks in advance for any thoughts. -- With best regards, Ivan Kurnosov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx