Re: ceph and rsync

Mart van Santen <mart@xxxxxxxxxxxx> · Sat, 17 Dec 2016 11:27:47 +0100

Hello,

The way Wido explained is the correct way. I won't deny, however, last
year we had problems with our SSD disks and they did not perform well.
So we decided to replace all disks. As the replacement done by Ceph
caused highload/downtime on the clients (which was the reason we wanted
to replace the disks), we did this the rsync way. We did not encounter
any problem with that.

It is very important to flush the journal before syncing and correct the
journal symlinks before starting the new disk. Also make sure you disarm
the old disk, as it has the same disk ID, you will run in a lot of
problems if you reenable that disk by accident. So yes, it is possible,
it is very dangerous, and it is not recommended,

Attached the script we used to assist with the migration. (we were on
hammer back then) I'm not sure it is the latest version we have, it
formats a disk with ceph-disk prepare command, mount it the 'ceph' way,
and then print a series of commands to manually execute. And again a big
warning, use at your own risk.

regards,

mart

On 12/16/2016 09:46 PM, Brian :: wrote:
> The fact that you are all SSD I would do exactly what Wido said -
> gracefully remove the OSD and gracefully bring up the OSD on the new
> SSD.
>
> Let Ceph do what its designed to do. The rsync idea looks great on
> paper - not sure what issues you will run into in practise.
>
>
> On Fri, Dec 16, 2016 at 12:38 PM, Alessandro Brega
> <alessandro.brega1@xxxxxxxxx> wrote:
>> 2016-12-16 10:19 GMT+01:00 Wido den Hollander <wido@xxxxxxxx>:
>>>
>>>> Op 16 december 2016 om 9:49 schreef Alessandro Brega
>>>> <alessandro.brega1@xxxxxxxxx>:
>>>>
>>>>
>>>> 2016-12-16 9:33 GMT+01:00 Wido den Hollander <wido@xxxxxxxx>:
>>>>
>>>>>> Op 16 december 2016 om 9:26 schreef Alessandro Brega <
>>>>> alessandro.brega1@xxxxxxxxx>:
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I'm running a ceph cluster using 0.94.9-1trusty release on XFS for
>>>>>> RBD
>>>>>> only. I'd like to replace some SSDs because they are close to their
>>>>>> TBW.
>>>>>>
>>>>>> I know I can simply shutdown the OSD, replace the SSD, restart the
>>>>>> OSD
>>>>> and
>>>>>> ceph will take care of the rest. However I don't want to do it this
>>>>>> way,
>>>>>> because it leaves my cluster for the time of the rebalance/
>>>>>> backfilling
>>>>> in
>>>>>> a degraded state.
>>>>>>
>>>>>> I'm thinking about this process:
>>>>>> 1. keep old OSD running
>>>>>> 2. copy all data from current OSD folder to new OSD folder (using
>>>>>> rsync)
>>>>>> 3. shutdown old OSD
>>>>>> 4. redo step 3 to update to the latest changes
>>>>>> 5. restart OSD with new folder
>>>>>>
>>>>>> Are there any issues with this approach? Do I need any special rsync
>>>>> flags
>>>>>> (rsync -avPHAX --delete-during)?
>>>>>>
>>>>> Indeed X for transferring xattrs, but also make sure that the
>>>>> partitions
>>>>> are GPT with the proper GUIDs.
>>>>>
>>>>> I would never go for this approach in a running setup. Since it's a
>>>>> SSD
>>>>> cluster I wouldn't worry about the rebalance and just have Ceph do the
>>>>> work
>>>>> for you.
>>>>>
>>>>>
>>>> Why not - if it's completely safe. It's much faster (local copy),
>>>> doesn't
>>>> put load on the network (local copy), much safer (2-3 minutes instead of
>>>> 1-2 hours degraded time (2TB SSD)), and it's really simple (2 rsync
>>>> commands). Thank you.
>>>>
>>> I wouldn't say it is completely safe, hence my remark. If you copy, indeed
>>> make sure you copy all the xattrs, but also make sure the partitions tables
>>> match.
>>>
>>> That way it should work, but it's not a 100% guarantee.
>>>
>> Ok, thanks!  Can a ceph dev confirm? I do not want to loose any data ;)
>>
>> Alessandro
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

#!/usr/bin/env python

import argparse
import os
import stat
import sys
import re
from subprocess import call

#### WARNING
#### THIS IS A VERY DANGEROUS SCRIPT. NO GUARANTEES THIS WILL WORK FO YOU

print "Please read and understand the script before executing"
sys.exit(1)

if __name__ == "__main__":
	parser = argparse.ArgumentParser()
	parser.add_argument("-d", "--destination",  type=str, required=1, help="destination disk")
	parser.add_argument("-s", "--source",  type=str, required=1, help="source ods number")
	parser.add_argument("--force", help="force migration", action="store_true")

	force = False
	parted = False
	args = parser.parse_args()

	if args.force:
		force = True

	osd = args.source
	disk_id = args.destination
	disk = '/dev/' + disk_id
	# First we gonna check if provided disk is indeed a block device and has an empty 
        # parition table

	print 'Examining disk: %s' %(disk)

	# Does the device exists:
	if not os.path.exists(disk):
		print 'Abort: disk device not found'
		sys.exit(1)

	mode = os.stat(disk).st_mode

	if not stat.S_ISBLK(mode):
		print 'Abort: disk device is not a block device'
		sys.exit(1)

	if not re.match('^sd[a-z]{1,2}$', disk_id):
		print 'Abort: disk does is not a full disk. Did you provide a partition or lvm device?'
		sys.exit(1)

	disk_check = disk + '1'
	if os.path.exists(disk_check):
		if force:
			parted = True
		else:
			print 'Abort: there are already partions on this disk'
			print '       please zap the partition table is you'
			print '       want to use this disk'
			sys.exit(1)

	# Ok. Disk is fine. 

	# Check destination
	print 'Examining osd: %s' %(osd)
	if not re.match('^[0-9]+$', osd):
                print 'Abort: osd is not a numeric value'
                sys.exit(1)

	osd_path = '/var/lib/ceph/osd/ceph-' + osd

	if not os.path.isdir(osd_path):
		print 'Abort: path for osd not found'
                sys.exit(1)

	if not os.path.isfile(osd_path + '/whoami'):
		print 'Abort: whoami file not found for osd'
                sys.exit(1)

	# Ok. Looks fine

	tmp_mount = '/mnt/ceph-' + osd
	if not os.path.exists(tmp_mount):
		os.mkdir(tmp_mount)

	if not os.path.isdir(tmp_mount):
		print 'Abort: failed to make tmp mountpoint: %s' %(tmp_mount)
		sys.exit(1)

	# Prepare the disk
	call(['ceph-disk', 'suppress-activate', disk])

	if not parted:
		call(['ceph-disk', 'prepare', disk])

	# Mount the disk
	print "Mounting disk %s at %s" %(disk, tmp_mount)
	part = disk + '1'
	call(['mount', '-o', 'rw,noatime,attr2,inode64,noquota', part, tmp_mount]);

	print "OK"
	print "  You should start rsync now"

	# store usages of this disk
	print "  df -h|grep ceph-%s > /opt/df-%s" %(osd, osd)

	# stop running osd
	print "  stop ceph-osd id=%s" %(osd)

	# flush the journal 
	print "  ceph-osd --flush-journal -i %s" %(osd)

	# Sync, without overwrtinging the journal symlink & uuid
	print "  rsync -av -HAX --delete --exclude 'fsid' --exclude 'journal' --exclude 'journal_uuid' %s %s" %(osd_path +'/', tmp_mount +'/')

	# disarm the old disk
	print "  cd %s && mv whoami whoami.old" %(osd_path)
	print "  cd %s && mv active active.old" %(osd_path) 

	# unmount the old & new disk
	print "  cd ~ && umount %s" %(osd_path) 
	print "  umount %s" %(tmp_mount) 

	# mount the new disk on the normal path
	print "  mount -o rw,noatime,attr2,inode64,noquota %s %s" %(part, osd_path)

	# format the journal
	print "  ceph-osd -i %s --mkjournal" %(osd)

	# start
	print "  start ceph-osd id=%s" %(osd)
	print " "

# 1910  ceph-disk suppress-activate /dev/sdo
# 1911  ceph-disk suppress-activate /dev/sdp
# 1912  ceph-disk prepare /dev/sdo
# 1913  ceph-disk prepare /dev/sdp
Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com