Hello, On Wed, 16 Sep 2015 07:21:26 -0500 John-Paul Robinson wrote: > The move journal, partition resize, grow file system approach would > work nicely if the spare capacity were at the end of the disk. > That shouldn't matter, you can "safely" loose your journal in controlled circumstances. This would also be an ideal time to put your journals on SSDs. ^o^ Roughly (you do have a test cluster, do you? Or at least try this with just one OSD): 1. set noout just to be sure. 2. stop the OSD 3. "ceph-osd -i osdnum --flush-journal" for warm fuzzies (see man page or --help) 4. clobber your partitions in a way that leaves you with an intact data partition, grow that and the FS in it as desired. 5. re-init the journal with "ceph-osd -i osdnum --mkjournal" 6. start the OSD and rejoice. More below. > Unfortunately, the gdisk (0.8.1) end of disk location bug caused the > journal placement to be at the 800GB mark, leaving the largest remaining > partition at the end of the disk. I'm assuming the gdisk bug was > caused by overflowing a 32bit int during the -1000M offset from end of > disk calculation. When it computed the end of disk for the journal > placement on disks >2TB it dropped the 2TB part of the size and was left > only with the 800GB value, putting the journal there. After gdisk > created the journal at the 800GB mark (splitting the disk), > ceph-disk-prepare told gdisk to take the largest remaining partition for > data, using the 2TB partition at the end. > > Here's an example of the buggy partitioning: > > crowbar@da0-36-9f-0e-28-2c:~$ sudo gdisk -l /dev/sdd > GPT fdisk (gdisk) version 0.8.8 > > Partition table scan: > MBR: protective > BSD: not present > APM: not present > GPT: present > > Found valid GPT with protective MBR; using GPT. > Disk /dev/sdd: 5859442688 sectors, 2.7 TiB > Logical sector size: 512 bytes > Disk identifier (GUID): 6F76BD12-05D6-4FA2-A132-CAC3E1C26C81 > Partition table holds up to 128 entries > First usable sector is 34, last usable sector is 5859442654 > Partitions will be aligned on 2048-sector boundaries > Total free space is 1562425343 sectors (745.0 GiB) > > Number Start (sector) End (sector) Size Code Name > 1 1564475392 5859442654 2.0 TiB FFFF ceph data > 2 1562425344 1564475358 1001.0 MiB FFFF ceph journal > > > > I assume I could still follow a disk-level relocation of data using dd > and shift all my content forward in the disk and then grow the file > system to the end, but this would take a significant amount of time, > more than a quick restart of the OSD. > > This leaves me the option of setting noout and hoping for the best (no > other failures) during my somewhat lengthy dd data movement or taking my > osd down and letting the cluster begin repairing the redundancy. > > If I follow the second option of normal osd loss repair, my disk > repartition step would be fast and I could bring the OSD back up rather > quickly. Does taking an OSD out of service, erasing it and bringing the > same OSD back into service present any undue stress to the cluster? > Undue is such a nicely ambiguous word. Recovering/Backfilling an OSD will stress your cluster, especially considering that you're not using SSDs and a positively ancient version of Ceph. Make sure to set all appropriate recovery/backfill options to their minimum. OTOH your cluster should be able to handle losses of OSDs w/o melting down and given the presumed age of your cluster you must have had OSD failures before. How did it fare then? I have one cluster where loosing an OSD would be just background noise, while another one would be seriously impacted by such a loss (working on correcting that). Regards, Christian > I'd prefer to use the second option if I can because I'm likely to > repeat this in the near future in order to add encryption to these disks. > > ~jpr > > On 09/15/2015 06:44 PM, Lionel Bouton wrote: > > Le 16/09/2015 01:21, John-Paul Robinson a écrit : > >> Hi, > >> > >> I'm working to correct a partitioning error from when our cluster was > >> first installed (ceph 0.56.4, ubuntu 12.04). This left us with 2TB > >> partitions for our OSDs, instead of the 2.8TB actually available on > >> disk, a 29% space hit. (The error was due to a gdisk bug that > >> mis-computed the end of the disk during the ceph-disk-prepare and > >> placed the journal at the 2TB mark instead of the true end of the > >> disk at 2.8TB. I've updated gdisk to a newer release that works > >> correctly.) > >> > >> I'd like to fix this problem by taking my existing 2TB OSDs offline > >> one at a time, repartitioning them and then bringing them back into > >> the cluster. Unfortunately I can't just grow the partitions, so the > >> repartition will be destructive. > > Hum, why should it be? If the journal is at the 2TB mark, you should be > > able to: > > - stop the OSD, > > - flush the journal, (ceph-osd -i <osdid> --flush-journal), > > - unmount the data filesystem (might be superfluous but the kernel > > seems to cache the partition layout when a partition is active), > > - remove the journal partition, > > - extend the data partition, > > - place the journal partition at the end of the drive (in fact you > > probably want to write a precomputed partition layout in one go). > > - mount the data filesystem, resize it online, > > - ceph-osd -i <osdid> --mkjournal (assuming your setup can find the > > partition again automatically without reconfiguration) > > - start the OSD > > > > If you script this you should not have to use noout: the OSD should > > come back in a matter of seconds and the impact on the storage network > > minimal. > > > > Note that the start of the disk is where you get the best sequential > > reads/writes. Given that most data accesses are random and all journal > > accesses are sequential I put the journal at the start of the disk when > > data and journal are sharing the same platters. > > > > Best regards, > > > > Lionel > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com