Re: bluefs enospc

Derek Yarnell <derek@xxxxxxxxxxxxxx> · Wed, 18 Mar 2020 08:31:09 -0400

Hi Igor,

I just want to thank you for taking the time to help with this issue.

On 3/18/20 5:30 AM, Igor Fedotov wrote:
>>> Most probably you will need additional 30GB of free space per each OSD
>>> if going this way. So please let me know if you can afford this.
>> Well I had already increased 709's initial space from 106GB to 200GB and
>> now I gave it 10GB more but it still can not actually resize.  Here is
>> the relevant information I think but the full logs is here[0].  I then
>> did it with 30G (now total of 240G) and it still failed[1].  I am out of
>> space without some additional hardware in this node though I have an
>> idea.  If I knew what size it is (and what space it needs for recovery
>> this would be very helpful).
> 
> There is no much sense in increasing main device for this specific OSD
> (and similarly failing one, i.e. OSDs mentioning RocksDB recovery in
> back trace) at this point.
> 
> It's in the "deadlock" state I mentioned before. And hence expand is
> unable to proceed.
> 
> I'm checking some workarounds to get out of this state at the moment.
> Still in progress though.
> 
> What I meant before is that you need more available space if workaround
> would be the assignment of a  new standalone DB volume. It's a
> questionable way so I'm trying other ways for now.

I had to go forward with this and migrate the db to a separate partition
as we were production impacting.  I did some testing by making a copy
(dd'ing to new LVM) of one of the OSDs and performing the steps before I
tried this.  Essential steps (understanding this is a Nautilus cluster)
for someone who might come across this in the future (for each OSD you
need to do this for):

1) create new partition to hold the db volume
  lvcreate -L30G -n db-20-6 /dev/ceph-db-vol04
2) migrate with the ceph-bluestore-tool
  ceph-bluestore-tool bluefs-bdev-migrate --path
/var/lib/ceph/osd/ceph-715 --dev-target /dev/ceph-db-vol04/db-20-6
--devs-source /var/lib/ceph/osd/ceph-715/block
3) make sure the block device are owned by the ceph user (
  chown -h ceph:ceph /var/lib/ceph/osd/ceph-715/block.db
  chown ceph:ceph /dev/ceph-db-vol04/db-20-6
4) run the ceph-bluestore-tool repair
  ceph-bluestore-tool --log-level 30 --path /var/lib/ceph/osd/ceph-715
--command repair
5) test compaction
  ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-715 compact
6) start the OSD
  systemctl start ceph-osd@715

Lesson is that bluestore does not handle well when you hit capacity like
this with the simple configuration of co-located rocksdb and data.  I
think that even for fast disks such as nvme you should always create a
separate db partition as this deadlock scenario is very problematic if
you don't have additional storage (or can add some quickly).

It seems that we had about 2million cephfs log segments that were behind
on trimming.  I am not sure where these segments are kept but I am
guessing in the mds metadata pool which seems to have driven this
maximum space issue.  As we are now down to about %17 used in the nvme
class when we were at 100% during this issue.

Thanks,
derek

-- 
Derek T. Yarnell
Director of Computing Facilities
University of Maryland
Institute for Advanced Computer Studies
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx