Re: question on reusing OSD

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Wed, 16 Sep 2015 11:05:11 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

My understanding of growing file systems is the same as yours, it can
only grow at the end not the beginning. In addition to that, having
partition 2 before partition 1 just cries to me to have it fixed, but
that is just aesthetic.

Because the weights of the drives will be different, there will be
some additional data movement (probably minimized if you are using
straw2). Setting noout will prevent Ceph from shuffling data around
while you are making the changes. When you bring the OSD back in, it
should receive only the PGs that were on it before minimizing the data
movement in the cluster. But because you are adding 800 GB, it will
want to take a few more PGs and so some shuffling in the cluster is
inevitable.

I don't know how well it would work, but you could bring in all the
reformatted OSDs in at the same weight as the current weight and then
when you have them all re-done, edit the crush map to set the weights
right, ideally the ratio would be the same so no (or very little) data
movement would occur. Due to an error in the straw algorithm, there is
still the potential of large amounts of data movement with small
weight changes.

As to your question about adding the disk before the rebalance is
completed, it will be fine to do so. Ceph will complete the PGs that
are currently being relocated, but compute new locations based on the
new disk. This may result in a PG that just finished moving to be
relocated again. The cluster will still perform and not lose data.

About saving OSD IDs; I only know that if you don't have gaps in your
OSDs (some were retired and not replaced) then if you remove an OSD
and recreate it, it will get the same number as the lowest available
number is the same as the OSD being replaced. I don't know about
saving off the files before wiping the OSD if it will keep the
identity.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV+aE4CRDmVDuy+mK58QAA7hEQAIluPpdYtvhpkIJiWabb
jWBkjOk3W6Am9aosQm88IF3biOMVGBQN2Xs9PgDW2lMz4aU1Vh6rpACCRFt0
Xn46pLanS4lPF/nYClUhu34z5LzNOZv84YEhwbc9KOUHIUs0Ijv7AlkyOn3S
bn1fbx7YUVbliqj6171jvEZKYndYdVe/nLeGVQu+DAkFyycSe+cj4fSnXtgr
xkRd6EDLiXBf8YuqX1sLjwDrtVYoNiPh4R7q1XA1zOkemuMlqwCwxCCJAxuq
5mKMg3DbJfPelSeOV6GXrMJt7GGTj8qUDzBGhvfhPBu1/XtfgRQar6VTi3gG
tdE0S+i8u5Ir9ze8aGvcl7ocmJXtcDa4LIyKmspz1vhPHCgG451W/vCu4mPV
lhym50/+arLSePxoZiQLwazfCx2T3XxcGBOK2KJ13rMVnt4HXsnfnG1x4T9U
0yIolZhPJDY30kyNXAEkivXnShfT9iOsIEFgb3LwhMJNR3uVVgOzQOL5CGlj
NDj5ZebzqsowfflwRxhQIWTo+F2zLXMt5gv5Xqq8UeLuEsx81I9wJh0+DwYM
ISHOHtE/COhlaRiyEk1q3ZzZe56baW5W3KnjNuYmF13jpMfS2ctoAEAUvGxS
d4frVCFJYXZ+5d8b7dYTU5mbqKe59yEPq3yjAOIZPL9PWn1jHfgjylvOMyMw
hihd
=GGct
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Sep 16, 2015 at 10:12 AM, John-Paul Robinson <jpr@xxxxxxx> wrote:
> Christian,
>
> Thanks for the feedback.
>
> I guess I'm wondering about step 4 "clobber partition, leaving data in
> tact and grow partition and the file system as needed".
>
> My understanding of xfs_growfs is that the free space must be at the end
> of the existing file system.  In this case the existing partition starts
> around the 800GB mark on the disk and and extends to the end of the
> disk.  My goal is to add the first 800GB on the disk to that partition
> so it can become a single data partition.
>
> Note that my volumes are not LVM based so I can't extend the volume by
> incorporating the free space at the start of the disk.
>
> Am I misunderstanding something about file system grow commands?
>
> Regarding your comments, on impact to the cluster of a downed OSD.  I
> have lost OSDs and the impact is minimal (acceptable).
>
> My concern is around taking an OSD down, having the cluster initiate
> recovery and then bringing that same OSD back into the cluster in an
> empty state.  Are the placement groups that originally had data on this
> OSD already remapped by this point (even if they aren't fully recovered)
> so that bring the empty, replacement OSD on-line simply causes a
> different set of placement groups to be mapped onto it to achieve the
> rebalance?
>
> Thanks,
>
> ~jpr
>
> On 09/16/2015 08:37 AM, Christian Balzer wrote:
>> Hello,
>>
>> On Wed, 16 Sep 2015 07:21:26 -0500 John-Paul Robinson wrote:
>>
>>> > The move  journal, partition resize, grow file system approach would
>>> > work nicely if the spare capacity were at the end of the disk.
>>> >
>> That shouldn't matter, you can "safely" loose your journal in controlled
>> circumstances.
>>
>> This would also be an ideal time to put your journals on SSDs. ^o^
>>
>> Roughly (you do have a test cluster, do you? Or at least try this with
>> just one OSD):
>>
>> 1. set noout just to be sure.
>> 2. stop the OSD
>> 3. "ceph-osd -i osdnum --flush-journal" for warm fuzzies (see man page or
>> --help)
>> 4. clobber your partitions in a way that leaves you with an intact data
>> partition, grow that and the FS in it as desired.
>> 5. re-init the journal with "ceph-osd -i osdnum --mkjournal"
>> 6. start the OSD and rejoice.
>>
>> More below.
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com