Hi Igor, I just want to thank you for taking the time to help with this issue. On 3/18/20 5:30 AM, Igor Fedotov wrote: >>> Most probably you will need additional 30GB of free space per each OSD >>> if going this way. So please let me know if you can afford this. >> Well I had already increased 709's initial space from 106GB to 200GB and >> now I gave it 10GB more but it still can not actually resize. Here is >> the relevant information I think but the full logs is here[0]. I then >> did it with 30G (now total of 240G) and it still failed[1]. I am out of >> space without some additional hardware in this node though I have an >> idea. If I knew what size it is (and what space it needs for recovery >> this would be very helpful). > > There is no much sense in increasing main device for this specific OSD > (and similarly failing one, i.e. OSDs mentioning RocksDB recovery in > back trace) at this point. > > It's in the "deadlock" state I mentioned before. And hence expand is > unable to proceed. > > I'm checking some workarounds to get out of this state at the moment. > Still in progress though. > > What I meant before is that you need more available space if workaround > would be the assignment of a new standalone DB volume. It's a > questionable way so I'm trying other ways for now. I had to go forward with this and migrate the db to a separate partition as we were production impacting. I did some testing by making a copy (dd'ing to new LVM) of one of the OSDs and performing the steps before I tried this. Essential steps (understanding this is a Nautilus cluster) for someone who might come across this in the future (for each OSD you need to do this for): 1) create new partition to hold the db volume lvcreate -L30G -n db-20-6 /dev/ceph-db-vol04 2) migrate with the ceph-bluestore-tool ceph-bluestore-tool bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-715 --dev-target /dev/ceph-db-vol04/db-20-6 --devs-source /var/lib/ceph/osd/ceph-715/block 3) make sure the block device are owned by the ceph user ( chown -h ceph:ceph /var/lib/ceph/osd/ceph-715/block.db chown ceph:ceph /dev/ceph-db-vol04/db-20-6 4) run the ceph-bluestore-tool repair ceph-bluestore-tool --log-level 30 --path /var/lib/ceph/osd/ceph-715 --command repair 5) test compaction ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-715 compact 6) start the OSD systemctl start ceph-osd@715 Lesson is that bluestore does not handle well when you hit capacity like this with the simple configuration of co-located rocksdb and data. I think that even for fast disks such as nvme you should always create a separate db partition as this deadlock scenario is very problematic if you don't have additional storage (or can add some quickly). It seems that we had about 2million cephfs log segments that were behind on trimming. I am not sure where these segments are kept but I am guessing in the mds metadata pool which seems to have driven this maximum space issue. As we are now down to about %17 used in the nvme class when we were at 100% during this issue. Thanks, derek -- Derek T. Yarnell Director of Computing Facilities University of Maryland Institute for Advanced Computer Studies _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx