Re: Very slow snaptrim operations blocking client I/O

Victor Rodriguez <vrodriguez@xxxxxxxxxxxxx> · Wed, 15 Feb 2023 16:38:26 +0100

An update on this for the record:

To fully solve this I've had to destroy each OSD and create then again, 
one by one. I could have done it one host at a time but I've preferred 
to be on the safest side just in case something else went wrong.

The values for num_pgmeta_omap (which I don't know what it is, yet) on 
the new OSD were similar to other clusters (I've seen from 300000 to 
700000 aprox), so I believe the characteristics of the data in the 
cluster does not determine how big or small num_pgmeta_omap should be.

One thing I've noticed is that /bad/ or /damanged/ OSDs (i.e. those 
showing high CPU usage and poor performance doing the trim operation) 
took much more time to calculate their histogram, even if their 
num_pgmeta_opmap was low:

(/bad //OSD)/:
# time ceph daemon osd.1 calc_objectstore_db_histogram | grep 
"num_pgmeta_omap"
    "num_pgmeta_omap": 673208,

real    1m14,549s
user    0m0,075s
sys    0m0,025s

(/good new OSD/):
#  time ceph daemon osd.1 calc_objectstore_db_histogram | grep 
"num_pgmeta_omap"
    "num_pgmeta_omap": 434298,

real    0m18,022s
user    0m0,078s
sys    0m0,023s

Maybe is worth checking that histogram from time to time as a way to 
measure the OSD "health"?

Again, thanks everyone.

On 1/30/23 18:18, Victor Rodriguez wrote:

On 1/30/23 15:15, Ana Aviles wrote:
Hi,

Josh already suggested, but I will one more time. We had similar 
behaviour upgrading from Nautilus to Pacific. In our case compacting 
the OSDs did the trick.

Thanks for chimming in! Unfortunately, in my case neither an online 
compaction (ceph tell osd.ID compact) or an offline repair 
(ceph-bluestore-tool repair --path /var/lib/ceph/osd/OSD_ID) does 
help. Compactions seem to compact some amount. I think that OSD log 
dumps information about the size of rocksdb. It went from this:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0 3.9 3.9       
0.0   1.0      0.0     62.2 64.81 61.59        89    0.728       0      0
  L1      3/0   132.84 MB   0.5      7.0     3.9      3.1 5.0 
2.0       0.0   1.3     63.8     46.1 112.11 108.52        23    
4.874     56M  7276K
  L2     12/0   690.99 MB   0.8      6.5     1.8      4.7 5.6 
0.9       0.1   3.2     21.4     18.5 310.78 307.14        28   
11.099    165M  3077K
  L3     54/0    3.37 GB   0.1      0.9     0.3      0.6 0.5 
-0.1       0.0   1.6     35.9     20.2 24.84 24.49         4    
6.210     24M    15M
 Sum     69/0    4.17 GB   0.0     14.4     6.0      8.3 15.1 
6.7       0.1   3.8     28.7     30.1 512.54 501.74       144    
3.559    246M    26M
 Int      0/0    0.00 KB   0.0      0.8     0.3      0.5 0.6 0.1       
0.0  14.1     27.5     20.7 31.13 30.73         4    7.783     18M  4086K

To this:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

  L0      2/0   72.42 MB   0.5      0.0     0.0      0.0 0.1 0.1       
0.0   1.0      0.0     63.2 1.14 0.84         2    0.572       0      0
  L3     48/0    3.10 GB   0.1      0.0     0.0      0.0 0.0 0.0       
0.0   0.0      0.0      0.0 0.00 0.00         0    0.000       0      0
 Sum     50/0    3.17 GB   0.0      0.0     0.0      0.0 0.1 0.1       
0.0   1.0      0.0     63.2 1.14 0.84         2    0.572       0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0 0.0 0.0       
0.0   0.0      0.0      0.0 0.00 0.00         0    0.000       0      0

Still, it feels "too big" compared to some other OSD in other 
similarily sized clusters, making me think that there's some kind of 
"garbage" making the trim to go crazy.

For us there was no performance impact running the compaction (ceph 
osd daemon osd.0 compact) although we run them in batches and not all 
at once on all OSDs just in case. Also, no need to restart OSDs for 
this operation.

Yes, compacting had no perceived impact on client performance, just 
some higher CPU usage for the OSD process.

Does anyone knows by any chance the meaning of "num_pgmeta_omap" on 
ceph daemon osd.ID calc_objectstore_db_histogram output? As I 
mentioned, the OSDs in this cluster have very different values in that 
field but all other clusters have much similar values:

osd.0:    "num_pgmeta_omap": 17526766,
osd.1:    "num_pgmeta_omap": 2653379,
osd.2:    "num_pgmeta_omap": 12358703,
osd.3:    "num_pgmeta_omap": 6404975,
osd.6:    "num_pgmeta_omap": 19845318,
osd.7:    "num_pgmeta_omap": 6043083,
osd.12:   "num_pgmeta_omap": 18666776,
osd.13:    "num_pgmeta_omap": 615846,
osd.14:    "num_pgmeta_omap": 13190188,

Thanks a lot!

--
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx