Re: Shared WAL/DB device partition for multiple OSDs?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Note that instead of including the step to use the UUID in the osd creation like [1] this, I opted to separate it out in those instructions.  That was to simplify the commands and to give people an idea of how to fix their OSDs if they created them using the device name instead of UUID.  It would be simpler to just create the OSD using the partuuid instead.  Also not mentioned in my previous response is that if you would like your OSDs to be encrypted at rest, you should add --dmcrypt to the ceph-volume command (included in the example below).

[1] # Create the OSD
echo "/dev/sdb /dev/nvme0n1p2
/dev/sdc /dev/nvme0n1p3" | while read hdd db; do
  uuid=$(ls -l /dev/disk/by-partuuid/ | awk '/'${db}'$/ {print $9}')
  ceph-volume lvm create --bluestore --dmcrypt --data $hdd --block.db /dev/disk/by-partuuid/$uuid
done

On Fri, May 11, 2018 at 12:46 PM David Turner <drakonstein@xxxxxxxxx> wrote:
This thread is off in left field and needs to be brought back to how things work.

While multiple OSDs can use the same device for block/wal partitions, they each need their own partition.  osd.0 could use nvme0n1p1, osd.2/nvme0n1p2, etc.  You cannot use the same partition for each osd.  Ceph-volume will not create the db/wal partitions for you, you need to manually create the partitions to be used by the OSD.  There is no need to put a filesystem on top of the partition for the wal/db.  That is wasted overhead that will slow things down.

Back to the original email.

> Or do I need to use osd-db=/dev/nvme0n1p2 for data="">
> osd-db=/dev/nvme0n1p3 for data="" and so on?
This is what you need to do, but like said above, you need to create the partitions for --block-db yourself.  You talked about having a 10GB partition for this, but the general recommendation for block-db partitions is 10GB per 1TB of OSD.  If your OSD is a 4TB disk you should be looking closer to a 40GB block.db partition.  If your block.db partition is too small, then once it fills up it will spill over onto the data volume and slow things down.


> And just to make sure - if I specify "--osd-db", I don't need
> to set "--osd-wal" as well, since the WAL will end up on the
> DB partition automatically, correct?
This is correct.  The wal will automatically be placed on the db if not otherwise specified.


I don't use ceph-deploy, but the process for creating the OSDs should be something like this.  After the OSDs are created it is a good idea to make sure that the OSD is not looking for the db partition with the /dev/nvme0n1p2 distinction as that can change on reboots if you have multiple nvme devices.

# Make sure the disks are clean and ready to use as an OSD
for hdd in /dev/sd{b..c}; do
  ceph-volume lvm zap $hdd --destroy
done

# Create the nvme db partitions (assuming 10G size for a 1TB OSD)
for partition in {2..3}; do
  sgdisk -c /dev/nvme0n1 -n:$partition:0:+10G -c:$partition:'ceph db'
done

# Create the OSD
echo "/dev/sdb /dev/nvme0n1p2
/dev/sdc /dev/nvme0n1p3" | while read hdd db; do
  ceph-volume lvm create --bluestore --data $hdd --block.db $db
done

# Fix the OSDs to look for the block.db partition by UUID instead of its device name.
for db in /var/lib/ceph/osd/*/block.db; do
  dev=$(readlink $db | grep -Eo nvme[[:digit:]]+n[[:digit:]]+p[[:digit:]]+ || echo false)
  if [[ "$dev" != false ]]; then
    uuid=$(ls -l /dev/disk/by-partuuid/ | awk '/'${dev}'$/ {print $9}')
    ln -sf /dev/disk/by-partuuid/$uuid $db
  fi
done
systemctl restart ceph-osd.target

On Fri, May 11, 2018 at 10:59 AM João Paulo Sacchetto Ribeiro Bastos <joaopaulosr95@xxxxxxxxx> wrote:
Actually, if you go to https://ceph.com/community/new-luminous-bluestore/ you will see that DB/WAL work on a XFS partition, while the data itself goes on a raw block.

Also, I told you the wrong command in the last mail. When i said --osd-db it should be --block-db.

On Fri, May 11, 2018 at 11:51 AM Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx> wrote:
Hi,

thanks for the advice! I'm a bit confused now, though. ;-)
I thought DB and WAL were supposed to go on raw block
devices, not file systems?


Cheers,

Oliver


On 11.05.2018 16:01, João Paulo Sacchetto Ribeiro Bastos wrote:
> Hello Oliver,
>
> As far as I know yet, you can use the same DB device for about 4 or 5
> OSDs, just need to be aware of the free space. I'm also developing a
> bluestore cluster, and our DB and WAL will be in the same SSD of about
> 480GB serving 4 OSD HDDs of 4 TB each. About the sizes, its just a
> feeling because I couldn't find yet any clear rule about how to measure
> the requirements.
>
> * The only concern that took me some time to realize is that you should
> create a XFS partition if using ceph-deploy because if you don't it will
> simply give you a RuntimeError that doesn't give any hint about what's
> going on.
>
> So, answering your question, you could do something like:
> $ ceph-deploy osd create --bluestore --data="" --block-db
> /dev/nvme0n1p1 $HOSTNAME
> $ ceph-deploy osd create --bluestore --data="" --block-db
> /dev/nvme0n1p1 $HOSTNAME
>
> On Fri, May 11, 2018 at 10:35 AM Oliver Schulz
> <oliver.schulz@xxxxxxxxxxxxxx <mailto:oliver.schulz@xxxxxxxxxxxxxx>> wrote:
>
>     Dear Ceph Experts,
>
>     I'm trying to set up some new OSD storage nodes, now with
>     bluestore (our existing nodes still use filestore). I'm
>     a bit unclear on how to specify WAL/DB devices: Can
>     several OSDs share one WAL/DB partition? So, can I do
>
>           ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
>     --data="" HOSTNAME
>
>           ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
>     --data="" HOSTNAME
>
>           ...
>
>     Or do I need to use osd-db=/dev/nvme0n1p2 for data=""> >     osd-db=/dev/nvme0n1p3 for data="" and so on?
>
>     And just to make sure - if I specify "--osd-db", I don't need
>     to set "--osd-wal" as well, since the WAL will end up on the
>     DB partition automatically, correct?
>
>
>     Thanks for any hints,
>
>     Oliver
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
>
> João Paulo Sacchetto Ribeiro Bastos
> +55 31 99279-7092
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--

João Paulo Bastos
DevOps Engineer at Mav Tecnologia
Belo Horizonte - Brazil
+55 31 99279-7092

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux