Nautilus: pg_autoscaler causes mon slow ops

Eugen Block <eblock@xxxxxx> · Tue, 17 Sep 2019 07:14:36 +0000

Hi everyone,

we upgraded our production cluster just recently to version:

ceph version 14.2.3-349-g7b1552ea82  
(7b1552ea827cf5167b6edbba96dd1c4a9dc16937) nautilus (stable)

We then activated pg_autoscaler for two pools that had a bad pg_num  
and the result is satisfying.
However, after the rebalance finished the cluster became laggy. We  
noticed that two out of three MONs had a much higher CPU usage than  
usual, according to `top` the MON processes consumed more than 100%.  
Restarting the MON services and disabling pg_autoscaler resolved the  
issue. I've read that the balancer module can cause a higher load on  
the MGR daemon, is this somehow related?

Another thing to mention is the confusing calculation of the  
autoscaler. After the pg numbers had been corrected we got the warning  
about overcommitted pools:

1 subtrees have overcommitted pool target_size_bytes
1 subtrees have overcommitted pool target_size_ratio

The images pool was responsible for that. The confusing part was that  
sometimes autoscale-status displayed the size of that pool with more  
than 14 TB:

ceph osd pool autoscale-status
 POOL               SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
 TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  
 images           14399G                3.0        33713G  1.2813  
                1.0     128              on

And a couple of minutes later the pool suddenly only had around 4 TB of data:

ceph osd pool autoscale-status
 POOL               SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
 TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  
 images            4112G                3.0        33713G  0.3659  
                1.0     128              on      

There seems to be some kind of inconsistency here. The actual used  
storage of this pool according to `ceph df` is:

POOLS:
    POOL                ID     STORED      OBJECTS     USED         
%USED     MAX AVAIL
    images               1     4.1 TiB       1.01M      12 TiB      
49.73       4.1 TiB

Has anyone experienced something similar? Are these known issues?

Regards,
Eugen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx