Re: question on reusing OSD

Christian Balzer <chibi@xxxxxxx> · Wed, 16 Sep 2015 22:37:17 +0900

Hello,

On Wed, 16 Sep 2015 07:21:26 -0500 John-Paul Robinson wrote:

> The move  journal, partition resize, grow file system approach would
> work nicely if the spare capacity were at the end of the disk.
>
That shouldn't matter, you can "safely" loose your journal in controlled
circumstances.

This would also be an ideal time to put your journals on SSDs. ^o^

Roughly (you do have a test cluster, do you? Or at least try this with
just one OSD):

1. set noout just to be sure.
2. stop the OSD
3. "ceph-osd -i osdnum --flush-journal" for warm fuzzies (see man page or
--help)
4. clobber your partitions in a way that leaves you with an intact data
partition, grow that and the FS in it as desired.
5. re-init the journal with "ceph-osd -i osdnum --mkjournal"
6. start the OSD and rejoice. 

More below.

> Unfortunately, the gdisk (0.8.1) end of disk location bug caused the
> journal placement to be at the 800GB mark, leaving the largest remaining
> partition at the end of the disk.   I'm assuming the gdisk bug was
> caused by overflowing a 32bit int during the -1000M offset from end of
> disk calculation.  When it computed the end of disk for the journal
> placement on disks >2TB it dropped the 2TB part of the size and was left
> only with the 800GB value, putting the journal there.  After gdisk
> created the journal at the 800GB mark (splitting the disk),
> ceph-disk-prepare told gdisk to take the largest remaining partition for
> data, using the 2TB partition at the end.
> 
> Here's an example of the buggy partitioning:
> 
>     crowbar@da0-36-9f-0e-28-2c:~$ sudo gdisk -l /dev/sdd
>     GPT fdisk (gdisk) version 0.8.8
> 
>     Partition table scan:
>       MBR: protective
>       BSD: not present
>       APM: not present
>       GPT: present
> 
>     Found valid GPT with protective MBR; using GPT.
>     Disk /dev/sdd: 5859442688 sectors, 2.7 TiB
>     Logical sector size: 512 bytes
>     Disk identifier (GUID): 6F76BD12-05D6-4FA2-A132-CAC3E1C26C81
>     Partition table holds up to 128 entries
>     First usable sector is 34, last usable sector is 5859442654
>     Partitions will be aligned on 2048-sector boundaries
>     Total free space is 1562425343 sectors (745.0 GiB)
> 
>     Number  Start (sector)    End (sector)  Size       Code  Name
>        1      1564475392      5859442654   2.0 TiB     FFFF  ceph data
>        2      1562425344      1564475358   1001.0 MiB  FFFF  ceph journal
> 
> 
> 
> I assume I could still follow a disk-level relocation of data using dd
> and shift all my content forward in the disk and then grow the file
> system to the end, but this would take a significant amount of time,
> more than a quick restart of the OSD. 
> 
> This leaves me the option of setting noout and hoping for the best (no
> other failures) during my somewhat lengthy dd data movement or taking my
> osd down and letting the cluster begin repairing the redundancy.
> 
> If I follow the second option of normal osd loss repair, my disk
> repartition step would be fast and I could bring the OSD back up rather
> quickly.  Does taking an OSD out of service, erasing it and bringing the
> same OSD back into service present any undue stress to the cluster?  
> 
Undue is such a nicely ambiguous word.
Recovering/Backfilling an OSD will stress your cluster, especially
considering that you're not using SSDs and a positively ancient version of
Ceph. 

Make sure to set all appropriate recovery/backfill options to their
minimum.

OTOH your cluster should be able to handle losses of OSDs w/o melting down
and given the presumed age of your cluster you must have had OSD failures
before. 
How did it fare then?

I have one cluster where loosing an OSD would be just background noise,
while another one would be seriously impacted by such a loss (working on
correcting that).

Regards,

Christian
> I'd prefer to use the second option if I can because I'm likely to
> repeat this in the near future in order to add encryption to these disks.
> 
> ~jpr
> 
> On 09/15/2015 06:44 PM, Lionel Bouton wrote:
> > Le 16/09/2015 01:21, John-Paul Robinson a écrit :
> >> Hi,
> >>
> >> I'm working to correct a partitioning error from when our cluster was
> >> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
> >> partitions for our OSDs, instead of the 2.8TB actually available on
> >> disk, a 29% space hit.  (The error was due to a gdisk bug that
> >> mis-computed the end of the disk during the ceph-disk-prepare and
> >> placed the journal at the 2TB mark instead of the true end of the
> >> disk at 2.8TB. I've updated gdisk to a newer release that works
> >> correctly.)
> >>
> >> I'd like to fix this problem by taking my existing 2TB OSDs offline
> >> one at a time, repartitioning them and then bringing them back into
> >> the cluster.  Unfortunately I can't just grow the partitions, so the
> >> repartition will be destructive.
> > Hum, why should it be? If the journal is at the 2TB mark, you should be
> > able to:
> > - stop the OSD,
> > - flush the journal, (ceph-osd -i <osdid> --flush-journal),
> > - unmount the data filesystem (might be superfluous but the kernel
> > seems to cache the partition layout when a partition is active),
> > - remove the journal partition,
> > - extend the data partition,
> > - place the journal partition at the end of the drive (in fact you
> > probably want to write a precomputed partition layout in one go).
> > - mount the data filesystem, resize it online,
> > - ceph-osd -i <osdid> --mkjournal (assuming your setup can find the
> > partition again automatically without reconfiguration)
> > - start the OSD
> >
> > If you script this you should not have to use noout: the OSD should
> > come back in a matter of seconds and the impact on the storage network
> > minimal.
> >
> > Note that the start of the disk is where you get the best sequential
> > reads/writes. Given that most data accesses are random and all journal
> > accesses are sequential I put the journal at the start of the disk when
> > data and journal are sharing the same platters.
> >
> > Best regards,
> >
> > Lionel
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com