Re: Shared WAL/DB device partition for multiple OSDs?

Jacob DeGlopper <jacob@xxxxxxxx> · Fri, 11 May 2018 13:05:53 -0400



    Thanks, this is useful in general.  I have a semi-related
      question:
    Given an OSD server with multiple SSDs or NVME devices, is there
      an advantage to putting wal/db on a different device of the same
      speed?  For example, data on sda1, matching wal/db on sdb1,  and
      then data on sdb2 and wal/db on sda2?
        -- jacob

    
    On 05/11/2018 12:46 PM, David Turner
      wrote:

    
      This thread is off in left field and needs to be
        brought back to how things work.
        

        While multiple OSDs can use the same device for block/wal
          partitions, they each need their own partition.  osd.0 could
          use nvme0n1p1, osd.2/nvme0n1p2, etc.  You cannot use the same
          partition for each osd.  Ceph-volume will not create the
          db/wal partitions for you, you need to manually create the
          partitions to be used by the OSD.  There is no need to put a
          filesystem on top of the partition for the wal/db.  That is
          wasted overhead that will slow things down.
        

        Back to the original email.
        

        > Or do I need to use
            osd-db=/dev/nvme0n1p2 for data="">

          > osd-db=/dev/nvme0n1p3
            for data="" and so on?

          This is what you need to do, but like said above, you need to
          create the partitions for --block-db yourself.  You talked
          about having a 10GB partition for this, but the general
          recommendation for block-db partitions is 10GB per 1TB of
          OSD.  If your OSD is a 4TB disk you should be looking closer
          to a 40GB block.db partition.  If your block.db partition is
          too small, then once it fills up it will spill over onto the
          data volume and slow things down.

          
          > And just to make sure -
            if I specify "--osd-db", I don't need

          > to set "--osd-wal" as
            well, since the WAL will end up on the

          > DB partition
            automatically, correct?

        
        This is correct.  The wal will automatically be placed on
          the db if not otherwise specified.
        

        I don't use ceph-deploy, but the process for creating the
          OSDs should be something like this.  After the OSDs are
          created it is a good idea to make sure that the OSD is not
          looking for the db partition with the /dev/nvme0n1p2
          distinction as that can change on reboots if you have multiple
          nvme devices.
        

        # Make sure the disks are clean and ready to use as an OSD
        for hdd in /dev/sd{b..c}; do
          ceph-volume lvm zap $hdd --destroy
        done
        

        # Create the nvme db partitions (assuming 10G size for a
          1TB OSD)
        for partition in {2..3}; do
          sgdisk -c /dev/nvme0n1 -n:$partition:0:+10G
          -c:$partition:'ceph db'
        done
        

        # Create the OSD
        echo "/dev/sdb /dev/nvme0n1p2
        /dev/sdc /dev/nvme0n1p3" | while read hdd db; do
          ceph-volume lvm create --bluestore --data $hdd --block.db
          $db
        done
        

        # Fix the OSDs to look for the block.db partition by UUID
          instead of its device name.
        
          for db in /var/lib/ceph/osd/*/block.db; do
            dev=$(readlink $db | grep -Eo
            nvme[[:digit:]]+n[[:digit:]]+p[[:digit:]]+ || echo false)
            if [[ "$dev" != false ]]; then
              uuid=$(ls -l /dev/disk/by-partuuid/ | awk
            '/'${dev}'$/ {print $9}')
              ln -sf /dev/disk/by-partuuid/$uuid $db
            fi
          done
          systemctl restart ceph-osd.target
        
      
        On Fri, May 11, 2018 at 10:59 AM João Paulo
          Sacchetto Ribeiro Bastos <joaopaulosr95@xxxxxxxxx>
          wrote:

        
          Actually, if you go to https://ceph.com/community/new-luminous-bluestore/ you
            will see that DB/WAL work on a XFS partition, while the data
            itself goes on a raw block.
            

            Also, I told you the wrong command in the last mail.
              When i said --osd-db it should be --block-db.
          
          
            On Fri, May 11, 2018 at 11:51 AM Oliver
              Schulz <oliver.schulz@xxxxxxxxxxxxxx>
              wrote:

            
            Hi,

              
              thanks for the advice! I'm a bit confused now, though. ;-)

              I thought DB and WAL were supposed to go on raw block

              devices, not file systems?

              
              Cheers,

              
              Oliver

              
              On 11.05.2018 16:01, João Paulo Sacchetto Ribeiro Bastos
              wrote:

              > Hello Oliver,

              > 

              > As far as I know yet, you can use the same DB device
              for about 4 or 5 

              > OSDs, just need to be aware of the free space. I'm
              also developing a 

              > bluestore cluster, and our DB and WAL will be in the
              same SSD of about 

              > 480GB serving 4 OSD HDDs of 4 TB each. About the
              sizes, its just a 

              > feeling because I couldn't find yet any clear rule
              about how to measure 

              > the requirements.

              > 

              > * The only concern that took me some time to realize
              is that you should 

              > create a XFS partition if using ceph-deploy because
              if you don't it will 

              > simply give you a RuntimeError that doesn't give any
              hint about what's 

              > going on.

              > 

              > So, answering your question, you could do something
              like:

              > $ ceph-deploy osd create --bluestore --data=""
              --block-db 

              > /dev/nvme0n1p1 $HOSTNAME

              > $ ceph-deploy osd create --bluestore --data=""
              --block-db 

              > /dev/nvme0n1p1 $HOSTNAME

              > 

              > On Fri, May 11, 2018 at 10:35 AM Oliver Schulz 

              > <oliver.schulz@xxxxxxxxxxxxxx
              <mailto:oliver.schulz@xxxxxxxxxxxxxx>>
              wrote:

              > 

              >     Dear Ceph Experts,

              > 

              >     I'm trying to set up some new OSD storage nodes,
              now with

              >     bluestore (our existing nodes still use
              filestore). I'm

              >     a bit unclear on how to specify WAL/DB devices:
              Can

              >     several OSDs share one WAL/DB partition? So, can
              I do

              > 

              >           ceph-deploy osd create --bluestore
              --osd-db=/dev/nvme0n1p2

              >     --data="" HOSTNAME

              > 

              >           ceph-deploy osd create --bluestore
              --osd-db=/dev/nvme0n1p2

              >     --data="" HOSTNAME

              > 

              >           ...

              > 

              >     Or do I need to use osd-db=/dev/nvme0n1p2 for
              data="">
              >     osd-db=/dev/nvme0n1p3 for data="" and so
              on?

              > 

              >     And just to make sure - if I specify "--osd-db",
              I don't need

              >     to set "--osd-wal" as well, since the WAL will
              end up on the

              >     DB partition automatically, correct?

              > 

              > 

              >     Thanks for any hints,

              > 

              >     Oliver

              >     _______________________________________________

              >     ceph-users mailing list

              >     ceph-users@xxxxxxxxxxxxxx
              <mailto:ceph-users@xxxxxxxxxxxxxx>

              >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

              > 

              > -- 

              > 

              > João Paulo Sacchetto Ribeiro Bastos

              > +55 31 99279-7092

              > 

              _______________________________________________

              ceph-users mailing list

              ceph-users@xxxxxxxxxxxxxx

              http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

            
          -- 

          
              João
                  Paulo Bastos

                  DevOps Engineer at Mav Tecnologia

                  Belo Horizonte - Brazil

                +55 31 99279-7092
            
          
          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com