resizing the OSD

chibi@xxxxxxx (Christian Balzer) · Tue, 9 Sep 2014 10:13:12 +0900

Hello,

On Mon, 08 Sep 2014 09:53:58 -0700 JIten Shah wrote:

> 
> On Sep 6, 2014, at 8:22 PM, Christian Balzer <chibi at gol.com> wrote:
> 
> > 
> > Hello,
> > 
> > On Sat, 06 Sep 2014 10:28:19 -0700 JIten Shah wrote:
> > 
> >> Thanks Christian.  Replies inline.
> >> On Sep 6, 2014, at 8:04 AM, Christian Balzer <chibi at gol.com> wrote:
> >> 
> >>> 
> >>> Hello,
> >>> 
> >>> On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote:
> >>> 
> >>>> Hello Cephers,
> >>>> 
> >>>> We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of
> >>>> the stuff seems to be working fine but we are seeing some degrading
> >>>> on the osd's due to lack of space on the osd's. 
> >>> 
> >>> Please elaborate on that degradation.
> >> 
> >> The degradation happened on few OSD's because it got quickly filled
> >> up. They were not of the same size as the other OSD's. Now I want to
> >> remove these OSD's and readd them with correct size to match the
> >> others.
> > 
> > Alright, that's good idea, uniformity helps. ^^
> > 
> >>> 
> >>>> Is there a way to resize the
> >>>> OSD without bringing the cluster down?
> >>>> 
> >>> 
> >>> Define both "resize" and "cluster down".
> >> 
> >> Basically I want to remove the OSD's with incorrect size and readd
> >> them with the size matching the other OSD's. 
> >>> 
> >>> As in, resizing how? 
> >>> Are your current OSDs on disks/LVMs that are not fully used and thus
> >>> could be grown?
> >>> What is the size of your current OSDs?
> >> 
> >> The size of current OSD's is 20GB and we do have more unused space on
> >> the disk that we can make the LVM bigger and increase the size of the
> >> OSD's. I agree that we need to have all the disks of same size and I
> >> am working towards that.Thanks.
> >>> 
> > OK, so your OSDs are backed by LVM. 
> > A curious choice, any particular reason to do so?
> 
> We already had lvm?s carved out for some other project and were not
> using it so we decided to have OSD?s on those LVMs
> 
I see. ^^
You might want to do things quite a bit differently with your next cluster
and things you're learning from this one.

> > 
> > Either way, in theory you could grow things in place, obviously first
> > the LVM and then the underlying filesystem. Both ext4 and xfs support
> > online growing, so the OSD can keep running the whole time.
> > If you're unfamiliar with these things, play with them on a test
> > machine first. 
> > 
> > Now for the next step we will really need to know how you deployed ceph
> > and the result of "ceph osd tree" (not all 100 OSDs are needed, a
> > sample of a "small" and "big" OSD is sufficient).
> 
> Fixed all the sizes so all of them weight as 1
> [jshah at pv11p04si-mzk001 ~]$ ceph osd tree
> # id	weight	type name	up/down	reweight
> -1	99	root default
> -2	1		host pv11p04si-mslave0005
> 0	1			osd.0	up	1	
> -3	1		host pv11p04si-mslave0006
> 1	1			osd.1	up	1	
> -4	1		host pv11p04si-mslave0007
> 2	1			osd.2	up	1	
> -5	1		host pv11p04si-mslave0008
> 3	1			osd.3	up	1	
> -6	1		host pv11p04si-mslave0009
> 4	1			osd.4	up	1	
> -7	1		host pv11p04si-mslave0010
> 5	1			osd.5	up	1	
> > 
Alright then, your cluster already thinks all OSDs are the same, even if
they're not.

So go ahead with what I wrote below, grow the LVs to the size of the
others, grow the filesystem and you should be done. 

No further activity needed, zero impact to the cluster.

> > Depending on the results (it will probably have varying weights
> > depending on the size and a reweight value of 1 for all) you will need
> > to adjust the weight of the grown OSD in question accordingly with
> > "ceph osd crush reweight". 
> > That step will incur data movement, so do it one OSD at a time.
> > 
> >>> The normal way of growing a cluster is to add more OSDs.
> >>> Preferably of the same size and same performance disks.
> >>> This will not only simplify things immensely but also make them a lot
> >>> more predictable.
> >>> This of course depends on your use case and usage patterns, but often
> >>> when running out of space you're also running out of other resources
> >>> like CPU, memory or IOPS of the disks involved. So adding more
> >>> instead of growing them is most likely the way forward.
> >>> 
> >>> If you were to replace actual disks with larger ones, take them (the
> >>> OSDs) out one at a time and re-add it. If you're using ceph-deploy,
> >>> it will use the disk size as basic weight, if you're doing things
> >>> manually make sure to specify that size/weight accordingly.
> >>> Again, you do want to do this for all disks to keep things uniform.
> >>> 
> >>> If your cluster (pools really) are set to a replica size of at least
> >>> 2 (risky!) or 3 (as per Firefly default), taking a single OSD out
> >>> would of course never bring the cluster down.
> >>> However taking an OSD out and/or adding a new one will cause data
> >>> movement that might impact your cluster's performance.
> >>> 
> >> 
> >> We have a current replica size of 2 with 100 OSD's. How many can I
> >> loose without affecting the performance? I understand the impact of
> >> data movement.
> >> 
> > Unless your LVMs are in turn living on a RAID, a replica of 2 with 100
> > OSDs is begging Murphy for a double disk failure. I'm also curious on
> > how many actual physical disks those OSD live and how many physical
> > hosts are in your cluster.
> 
> we have 1 physical disk on each host and 1 OSD per host. So we have 100
> physical hosts for OSD?s and 5 physical hosts for MON + MDS.
> 
That's a very huge cluster for a very small dataset. ^o^
It also means that you can't loose more than one host (and/or disk) at a
time, too.

So you really will want to increase replication to 3, unless this is for
testing only and you don't care about the data.

Christian
> > So again, you can't loose more than one OSD at a time w/o loosing data.
> > 
> > The performance impact of losing a single OSD out of 100 should be
> > small, especially given the size of your OSDs. However w/o knowing
> > your actual cluster (hardware and otherwise) don't expect anybody here
> > to make accurate predictions. 
> > 
> > Christian
> > 
> >> --Jiten
> >> 
> >> 
> >> 
> >> 
> >> 
> >>> Regards,
> >>> 
> >>> Christian
> >>> -- 
> >>> Christian Balzer        Network/Systems Engineer                
> >>> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> >>> http://www.gol.com/
> >> 
> >> 
> > 
> > 
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi at gol.com   	Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/