Re: POOL_TARGET_SIZE_BYTES_OVERCOMMITTED

Joe Ryner <jryner@xxxxxxxx> · Wed, 1 May 2019 15:01:35 -0500

I think I have figured out the issue. POOL        SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE 
 images    28523G                3.0        68779G  1.2441                  1000              warn 

My images are 28523G with a replication level 3 and have a total of 68779G in Raw Capacity.
 According to the documentation http://docs.ceph.com/docs/master/rados/operations/placement-groups/  
"SIZE is the amount of data stored in the pool. TARGET SIZE, if present, is the amount of data the administrator has specified that they expect to eventually be stored in this pool. The system uses the larger of the two values for its calculation.
RATE is the multiplier for the pool that determines how much raw storage capacity is consumed. For example, a 3 replica pool will have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a ratio of 1.5.
RAW CAPACITY is the total amount of raw storage capacity on the OSDs that are responsible for storing this pool’s (and perhaps other pools’) data. RATIO is the ratio of that total capacity that this pool is consuming (i.e., ratio = size * rate / raw capacity)."
So ratio = "28523G * 3.0/68779G" = 1.2441x

So I'm oversubscribing by 1.2441x, thus the warning. 

But ... looking at #ceph df
POOL         ID     STORED      OBJECTS     USED        %USED     MAX AVAIL 
images        3     9.3 TiB       2.82M      28 TiB     57.94       6.7 TiB

I believe the 9.3TiB is the amount I have that is thinly provisioned vs a fully provisioned 28 TiB?
The raw capacity of the cluster is sitting at about 50% used.

Shouldn't the ratio be the amount STORED(from ceph df) * SIZE (from  ceph osd pool autoscale-status) / Raw Capacity, since ceph uses thin provisioning in rbd?
Otherwise, this ratio will only work for people who don't thin provision which goes against what ceph is doing with rbd
http://docs.ceph.com/docs/master/rbd/

On Wed, May 1, 2019 at 11:44 AM Joe Ryner <jryner@xxxxxxxx> wrote:
I have found a little more information.When I turn off pg_autoscaler the warning goes away turn it back on and the warning comes back.

I have ran the following:
# ceph osd pool autoscale-status
 POOL        SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE 
 images    28523G                3.0        68779G  1.2441                  1000              warn      
 locks     676.5M                3.0        68779G  0.0000                     8              warn      
 rbd           0                 3.0        68779G  0.0000                     8              warn      
 data          0                 3.0        68779G  0.0000                     8              warn      
 metadata   3024k                3.0        68779G  0.0000                     8              warn      

# ceph df
RAW STORAGE:
    CLASS     SIZE       AVAIL       USED        RAW USED     %RAW USED 
    hdd       51 TiB      26 TiB      24 TiB       24 TiB         48.15 
    ssd       17 TiB     8.5 TiB     8.1 TiB      8.1 TiB         48.69 
    TOTAL     67 TiB      35 TiB      32 TiB       32 TiB         48.28 

POOLS:
    POOL         ID     STORED      OBJECTS     USED        %USED     MAX AVAIL 
    data          0         0 B           0         0 B         0       6.7 TiB 
    metadata      1     6.3 KiB          21     3.0 MiB         0       6.7 TiB 
    rbd           2         0 B           2         0 B         0       6.7 TiB 
    images        3     9.3 TiB       2.82M      28 TiB     57.94       6.7 TiB 
    locks         4     215 MiB         517     677 MiB         0       6.7 TiB 

It looks to me like it thinks the images pool no right in the autoscale-status.

Below is a osd crush tree
# ceph osd crush tree
ID  CLASS WEIGHT   (compat) TYPE NAME             
 -1       66.73337          root default          
 -3       22.28214 22.28214     rack marack       
 -8        7.27475  7.27475         host abacus   
 19   hdd  1.81879  1.81879             osd.19    
 20   hdd  1.81879  1.42563             osd.20    
 21   hdd  1.81879  1.81879             osd.21    
 50   hdd  1.81839  1.81839             osd.50    
-10        7.76500  6.67049         host gold     
  7   hdd  0.86299  0.83659             osd.7     
  9   hdd  0.86299  0.78972             osd.9     
 10   hdd  0.86299  0.72031             osd.10    
 14   hdd  0.86299  0.65315             osd.14    
 15   hdd  0.86299  0.72586             osd.15    
 22   hdd  0.86299  0.80528             osd.22    
 23   hdd  0.86299  0.63741             osd.23    
 24   hdd  0.86299  0.77718             osd.24    
 25   hdd  0.86299  0.72499             osd.25    
 -5        7.24239  7.24239         host hassium  
  0   hdd  1.80800  1.52536             osd.0     
  1   hdd  1.80800  1.65421             osd.1     
 26   hdd  1.80800  1.65140             osd.26    
 51   hdd  1.81839  1.81839             osd.51    
 -2       21.30070 21.30070     rack marack2      
-12        7.76999  8.14474         host hamms    
 27   ssd  0.86299  0.99367             osd.27    
 28   ssd  0.86299  0.95961             osd.28    
 29   ssd  0.86299  0.80768             osd.29    
 30   ssd  0.86299  0.86893             osd.30    
 31   ssd  0.86299  0.92583             osd.31    
 32   ssd  0.86299  1.00227             osd.32    
 33   ssd  0.86299  0.73099             osd.33    
 34   ssd  0.86299  0.80766             osd.34    
 35   ssd  0.86299  1.04811             osd.35    
 -7        5.45636  5.45636         host parabola 
  5   hdd  1.81879  1.81879             osd.5     
 12   hdd  1.81879  1.81879             osd.12    
 13   hdd  1.81879  1.81879             osd.13    
 -6        2.63997  3.08183         host radium   
  2   hdd  0.87999  1.05594             osd.2     
  6   hdd  0.87999  1.10501             osd.6     
 11   hdd  0.87999  0.92088             osd.11    
 -9        5.43439  5.43439         host splinter 
 16   hdd  1.80800  1.80800             osd.16    
 17   hdd  1.81839  1.81839             osd.17    
 18   hdd  1.80800  1.80800             osd.18    
-11       23.15053 23.15053     rack marack3      
-13        8.63300  8.98921         host helm     
 36   ssd  0.86299  0.71931             osd.36    
 37   ssd  0.86299  0.92601             osd.37    
 38   ssd  0.86299  0.79585             osd.38    
 39   ssd  0.86299  1.08521             osd.39    
 40   ssd  0.86299  0.89500             osd.40    
 41   ssd  0.86299  0.92351             osd.41    
 42   ssd  0.86299  0.89690             osd.42    
 43   ssd  0.86299  0.92480             osd.43    
 44   ssd  0.86299  0.84467             osd.44    
 45   ssd  0.86299  0.97795             osd.45    
-40        7.27515  7.89609         host samarium 
 46   hdd  1.81879  1.90242             osd.46    
 47   hdd  1.81879  1.86723             osd.47    
 48   hdd  1.81879  1.93404             osd.48    
 49   hdd  1.81879  2.19240             osd.49    
 -4        7.24239  7.24239         host scandium 
  3   hdd  1.80800  1.76680             osd.3     
  4   hdd  1.80800  1.80800             osd.4     
  8   hdd  1.80800  1.80800             osd.8     
 52   hdd  1.81839  1.81839             osd.52    

Any ideas?

On Wed, May 1, 2019 at 9:32 AM Joe Ryner <jryner@xxxxxxxx> wrote:
Hi,
I have an old ceph cluster and have upgraded recently from Luminous to Nautilus.  After converting to Nautilus I decided it was time to convert to bluestore.

Before I converted the cluster was healthy but after I have a HEALTH_WARN

#ceph health detail
HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1 subtrees have overcommitted pool target_size_ratio
POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool target_size_bytes
    Pools ['data', 'metadata', 'rbd', 'images', 'locks'] overcommit available storage by 1.244x due to target_size_bytes    0  on pools []
POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool target_size_ratio
    Pools ['data', 'metadata', 'rbd', 'images', 'locks'] overcommit available storage by 1.244x due to target_size_ratio 0.000 on pools []

I started with a target_size ratio of .85 on the images pool and reduced it to 0 to hopefully get the warning to go away.  The cluster seems to be running fine, I just can't figure out what the problem is and how to make the message go away.  I restarted the monitors this morning in hopes to fix it.  Anyone have any ideas?

Thanks in advance

-- 
Joe Ryner
Associate Director
Center for the Application of Information Technologies (CAIT) - http://www.cait.org
Western Illinois University - http://www.wiu.edu

P: (309) 298-1804
F: (309) 298-2806

-- 
Joe Ryner
Associate Director
Center for the Application of Information Technologies (CAIT) - http://www.cait.org
Western Illinois University - http://www.wiu.edu

P: (309) 298-1804
F: (309) 298-2806

-- 
Joe Ryner
Associate Director
Center for the Application of Information Technologies (CAIT) - http://www.cait.org
Western Illinois University - http://www.wiu.edu

P: (309) 298-1804
F: (309) 298-2806
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com