Re: SSD Sizing for DB/WAL: 4% for large drives?

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Wed, 29 May 2019 08:05:04 +0200



    Hi,

    
    On 5/29/19 5:23 AM, Frank Yu wrote:

    
          Hi Jake, 
          

          I have same
            question about size of DB/WAL for OSD。My situations:  12 osd
            per OSD nodes, 8 TB(maybe 12TB later) per OSD, Intel NVMe
            SSD (optane P4800x) 375G per OSD nodes, which means DB/WAL
            can use about 30GB per OSD(8TB), I mainly use CephFS to
            serve the HPC cluster for ML.
          （plan to
            separate CephFS metadata to pool based on NVMe SSD, BTW,
            does this improve the performance a lot? any compares?)
        
      
    We have a similar setup, but 24 disks and 2x P4800X. And the
      375GB NVME drives are _not_ large enough:
    

    2019-05-29 07:00:00.000108 mon.bcf-03 [WRN] overall HEALTH_WARN
      BlueFS spillover detected on 22 OSD(s)
    root@bcf-10:~# parted /dev/nvme0n1 print

      Model: NVMe Device (nvme)

      Disk /dev/nvme0n1: 375GB

      Sector size (logical/physical): 512B/512B

      Partition Table: gpt

      Disk Flags: 

      
      Number  Start   End     Size    File system  Name  Flags

       1      1049kB  31.1GB  31.1GB

       2      31.1GB  62.3GB  31.1GB

       3      62.3GB  93.4GB  31.1GB

       4      93.4GB  125GB   31.1GB

       5      125GB   156GB   31.1GB

       6      156GB   187GB   31.1GB

       7      187GB   218GB   31.1GB

       8      218GB   249GB   31.1GB

       9      249GB   280GB   31.1GB

      10      280GB   311GB   31.1GB

      11      311GB   343GB   31.1GB

      12      343GB   375GB   32.6GB

    
    The second NVME has the same partition layout. The twelfth
      partition is actually large enough to hold all the data, but the
      other 11 partitions on this drive are a little bit too small. I'm
      still trying to calculate the exact sweet spot....
    

    With 24 OSDs and two of them having a
      just-large-enough-db-partition, I end up with 22 OSD not fully
      using their db partition and spilling over into the slow
      disk...exactly as reported by ceph.
    Details for one of the affected OSDs:
        "bluefs": {

              "gift_bytes": 0,

              "reclaim_bytes": 0,

              "db_total_bytes": 31138504704,

              "db_used_bytes": 2782912512,

              "wal_total_bytes": 0,

              "wal_used_bytes": 0,

              "slow_total_bytes": 320062095360,

              "slow_used_bytes": 5838471168,

              "num_files": 135,

              "log_bytes": 13295616,

              "log_compactions": 9,

              "logged_bytes": 338104320,

              "files_written_wal": 2,

              "files_written_sst": 5066,

              "bytes_written_wal": 375879721287,

              "bytes_written_sst": 227201938586,

              "bytes_written_slow": 65162240000,

              "max_bytes_wal": 0,

              "max_bytes_db": 5265940480,

              "max_bytes_slow": 7540310016

          },

      
    Maybe it's just matter of shifting some megabytes. We are about
      to deploy more of these nodes, so I would be grateful if anyone
      can comment on the correct size of the DB partitions. Otherwise
      I'll have to use a RAID-0 for two drives.
    

    Regards,
    Burkhard
    

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com