Hello, On Mon, 08 Sep 2014 09:53:58 -0700 JIten Shah wrote: > > On Sep 6, 2014, at 8:22 PM, Christian Balzer <chibi at gol.com> wrote: > > > > > Hello, > > > > On Sat, 06 Sep 2014 10:28:19 -0700 JIten Shah wrote: > > > >> Thanks Christian. Replies inline. > >> On Sep 6, 2014, at 8:04 AM, Christian Balzer <chibi at gol.com> wrote: > >> > >>> > >>> Hello, > >>> > >>> On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote: > >>> > >>>> Hello Cephers, > >>>> > >>>> We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of > >>>> the stuff seems to be working fine but we are seeing some degrading > >>>> on the osd's due to lack of space on the osd's. > >>> > >>> Please elaborate on that degradation. > >> > >> The degradation happened on few OSD's because it got quickly filled > >> up. They were not of the same size as the other OSD's. Now I want to > >> remove these OSD's and readd them with correct size to match the > >> others. > > > > Alright, that's good idea, uniformity helps. ^^ > > > >>> > >>>> Is there a way to resize the > >>>> OSD without bringing the cluster down? > >>>> > >>> > >>> Define both "resize" and "cluster down". > >> > >> Basically I want to remove the OSD's with incorrect size and readd > >> them with the size matching the other OSD's. > >>> > >>> As in, resizing how? > >>> Are your current OSDs on disks/LVMs that are not fully used and thus > >>> could be grown? > >>> What is the size of your current OSDs? > >> > >> The size of current OSD's is 20GB and we do have more unused space on > >> the disk that we can make the LVM bigger and increase the size of the > >> OSD's. I agree that we need to have all the disks of same size and I > >> am working towards that.Thanks. > >>> > > OK, so your OSDs are backed by LVM. > > A curious choice, any particular reason to do so? > > We already had lvm?s carved out for some other project and were not > using it so we decided to have OSD?s on those LVMs > I see. ^^ You might want to do things quite a bit differently with your next cluster and things you're learning from this one. > > > > Either way, in theory you could grow things in place, obviously first > > the LVM and then the underlying filesystem. Both ext4 and xfs support > > online growing, so the OSD can keep running the whole time. > > If you're unfamiliar with these things, play with them on a test > > machine first. > > > > Now for the next step we will really need to know how you deployed ceph > > and the result of "ceph osd tree" (not all 100 OSDs are needed, a > > sample of a "small" and "big" OSD is sufficient). > > Fixed all the sizes so all of them weight as 1 > [jshah at pv11p04si-mzk001 ~]$ ceph osd tree > # id weight type name up/down reweight > -1 99 root default > -2 1 host pv11p04si-mslave0005 > 0 1 osd.0 up 1 > -3 1 host pv11p04si-mslave0006 > 1 1 osd.1 up 1 > -4 1 host pv11p04si-mslave0007 > 2 1 osd.2 up 1 > -5 1 host pv11p04si-mslave0008 > 3 1 osd.3 up 1 > -6 1 host pv11p04si-mslave0009 > 4 1 osd.4 up 1 > -7 1 host pv11p04si-mslave0010 > 5 1 osd.5 up 1 > > Alright then, your cluster already thinks all OSDs are the same, even if they're not. So go ahead with what I wrote below, grow the LVs to the size of the others, grow the filesystem and you should be done. No further activity needed, zero impact to the cluster. > > Depending on the results (it will probably have varying weights > > depending on the size and a reweight value of 1 for all) you will need > > to adjust the weight of the grown OSD in question accordingly with > > "ceph osd crush reweight". > > That step will incur data movement, so do it one OSD at a time. > > > >>> The normal way of growing a cluster is to add more OSDs. > >>> Preferably of the same size and same performance disks. > >>> This will not only simplify things immensely but also make them a lot > >>> more predictable. > >>> This of course depends on your use case and usage patterns, but often > >>> when running out of space you're also running out of other resources > >>> like CPU, memory or IOPS of the disks involved. So adding more > >>> instead of growing them is most likely the way forward. > >>> > >>> If you were to replace actual disks with larger ones, take them (the > >>> OSDs) out one at a time and re-add it. If you're using ceph-deploy, > >>> it will use the disk size as basic weight, if you're doing things > >>> manually make sure to specify that size/weight accordingly. > >>> Again, you do want to do this for all disks to keep things uniform. > >>> > >>> If your cluster (pools really) are set to a replica size of at least > >>> 2 (risky!) or 3 (as per Firefly default), taking a single OSD out > >>> would of course never bring the cluster down. > >>> However taking an OSD out and/or adding a new one will cause data > >>> movement that might impact your cluster's performance. > >>> > >> > >> We have a current replica size of 2 with 100 OSD's. How many can I > >> loose without affecting the performance? I understand the impact of > >> data movement. > >> > > Unless your LVMs are in turn living on a RAID, a replica of 2 with 100 > > OSDs is begging Murphy for a double disk failure. I'm also curious on > > how many actual physical disks those OSD live and how many physical > > hosts are in your cluster. > > we have 1 physical disk on each host and 1 OSD per host. So we have 100 > physical hosts for OSD?s and 5 physical hosts for MON + MDS. > That's a very huge cluster for a very small dataset. ^o^ It also means that you can't loose more than one host (and/or disk) at a time, too. So you really will want to increase replication to 3, unless this is for testing only and you don't care about the data. Christian > > So again, you can't loose more than one OSD at a time w/o loosing data. > > > > The performance impact of losing a single OSD out of 100 should be > > small, especially given the size of your OSDs. However w/o knowing > > your actual cluster (hardware and otherwise) don't expect anybody here > > to make accurate predictions. > > > > Christian > > > >> --Jiten > >> > >> > >> > >> > >> > >>> Regards, > >>> > >>> Christian > >>> -- > >>> Christian Balzer Network/Systems Engineer > >>> chibi at gol.com Global OnLine Japan/Fusion Communications > >>> http://www.gol.com/ > >> > >> > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi at gol.com Global OnLine Japan/Fusion Communications > > http://www.gol.com/ > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/