Re: Very slow snaptrim operations blocking client I/O

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
we've encountered the same issue after upgrading to octopus on on of our
rbd cluster, and now it reappears after the autoscaler lowered the PGs form
8k to 2k for the RBD pool.

What we've done in the past:
- recreate all OSD after our 2nd incident with slow OPS in a single week
after the ceph upgrade (earlier september)
- upgraded the OS from centos7 to ubuntu focal after the third incident
(december)
- offline compact all OSDs a week ago, because we had some (~500) very old
snapshots lying around and hoped that the snaptrim works faster when the
rockzdb is freshly compacted.

After the first incident we had roughly three months of smooth sailing, and
now it is again roughly three months of smooth sailing later, and we
experience these slow OPS again.
It might be this time, because we have some OSDs (2TB SSD) with very few
PGs (~30) and some OSDs (8TB SSD) with a lot of PGs (~120).
I will try to compact all OSDs and check if it stops again, but I think I
need to bump the PGs up again to 4K PGs, because it started again when the
autoscaler lowered the PGs.

And from out data (prometheus), the apply latency goes up to 9seconds and
it mostly hits the 8TB disks.

I am currently cunning the "time ceph daemon osd.1
calc_objectstore_db_histogram" for all OSDs and get very mixed values, but
non of the values lies in the <1 minute range.


Am Mi., 15. Feb. 2023 um 16:42 Uhr schrieb Victor Rodriguez <
vrodriguez@xxxxxxxxxxxxx>:

> An update on this for the record:
>
> To fully solve this I've had to destroy each OSD and create then again,
> one by one. I could have done it one host at a time but I've preferred
> to be on the safest side just in case something else went wrong.
>
> The values for num_pgmeta_omap (which I don't know what it is, yet) on
> the new OSD were similar to other clusters (I've seen from 300000 to
> 700000 aprox), so I believe the characteristics of the data in the
> cluster does not determine how big or small num_pgmeta_omap should be.
>
> One thing I've noticed is that /bad/ or /damanged/ OSDs (i.e. those
> showing high CPU usage and poor performance doing the trim operation)
> took much more time to calculate their histogram, even if their
> num_pgmeta_opmap was low:
>
> (/bad //OSD)/:
> # time ceph daemon osd.1 calc_objectstore_db_histogram | grep
> "num_pgmeta_omap"
>      "num_pgmeta_omap": 673208,
>
> real    1m14,549s
> user    0m0,075s
> sys    0m0,025s
>
> (/good new OSD/):
> #  time ceph daemon osd.1 calc_objectstore_db_histogram | grep
> "num_pgmeta_omap"
>      "num_pgmeta_omap": 434298,
>
> real    0m18,022s
> user    0m0,078s
> sys    0m0,023s
>
>
> Maybe is worth checking that histogram from time to time as a way to
> measure the OSD "health"?
>
> Again, thanks everyone.
>
>
>
> On 1/30/23 18:18, Victor Rodriguez wrote:
> >
> > On 1/30/23 15:15, Ana Aviles wrote:
> >> Hi,
> >>
> >> Josh already suggested, but I will one more time. We had similar
> >> behaviour upgrading from Nautilus to Pacific. In our case compacting
> >> the OSDs did the trick.
> >
> > Thanks for chimming in! Unfortunately, in my case neither an online
> > compaction (ceph tell osd.ID compact) or an offline repair
> > (ceph-bluestore-tool repair --path /var/lib/ceph/osd/OSD_ID) does
> > help. Compactions seem to compact some amount. I think that OSD log
> > dumps information about the size of rocksdb. It went from this:
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >
> >   L0      0/0    0.00 KB   0.0      0.0     0.0      0.0 3.9 3.9
> > 0.0   1.0      0.0     62.2 64.81 61.59        89    0.728       0      0
> >   L1      3/0   132.84 MB   0.5      7.0     3.9      3.1 5.0
> > 2.0       0.0   1.3     63.8     46.1 112.11 108.52        23
> > 4.874     56M  7276K
> >   L2     12/0   690.99 MB   0.8      6.5     1.8      4.7 5.6
> > 0.9       0.1   3.2     21.4     18.5 310.78 307.14        28
> > 11.099    165M  3077K
> >   L3     54/0    3.37 GB   0.1      0.9     0.3      0.6 0.5
> > -0.1       0.0   1.6     35.9     20.2 24.84 24.49         4
> > 6.210     24M    15M
> >  Sum     69/0    4.17 GB   0.0     14.4     6.0      8.3 15.1
> > 6.7       0.1   3.8     28.7     30.1 512.54 501.74       144
> > 3.559    246M    26M
> >  Int      0/0    0.00 KB   0.0      0.8     0.3      0.5 0.6 0.1
> > 0.0  14.1     27.5     20.7 31.13 30.73         4    7.783     18M  4086K
> >
> > To this:
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >
> >   L0      2/0   72.42 MB   0.5      0.0     0.0      0.0 0.1 0.1
> > 0.0   1.0      0.0     63.2 1.14 0.84         2    0.572       0      0
> >   L3     48/0    3.10 GB   0.1      0.0     0.0      0.0 0.0 0.0
> > 0.0   0.0      0.0      0.0 0.00 0.00         0    0.000       0      0
> >  Sum     50/0    3.17 GB   0.0      0.0     0.0      0.0 0.1 0.1
> > 0.0   1.0      0.0     63.2 1.14 0.84         2    0.572       0      0
> >  Int      0/0    0.00 KB   0.0      0.0     0.0      0.0 0.0 0.0
> > 0.0   0.0      0.0      0.0 0.00 0.00         0    0.000       0      0
> >
> > Still, it feels "too big" compared to some other OSD in other
> > similarily sized clusters, making me think that there's some kind of
> > "garbage" making the trim to go crazy.
> >
> >
> >> For us there was no performance impact running the compaction (ceph
> >> osd daemon osd.0 compact) although we run them in batches and not all
> >> at once on all OSDs just in case. Also, no need to restart OSDs for
> >> this operation.
> >
> > Yes, compacting had no perceived impact on client performance, just
> > some higher CPU usage for the OSD process.
> >
> >
> > Does anyone knows by any chance the meaning of "num_pgmeta_omap" on
> > ceph daemon osd.ID calc_objectstore_db_histogram output? As I
> > mentioned, the OSDs in this cluster have very different values in that
> > field but all other clusters have much similar values:
> >
> > osd.0:    "num_pgmeta_omap": 17526766,
> > osd.1:    "num_pgmeta_omap": 2653379,
> > osd.2:    "num_pgmeta_omap": 12358703,
> > osd.3:    "num_pgmeta_omap": 6404975,
> > osd.6:    "num_pgmeta_omap": 19845318,
> > osd.7:    "num_pgmeta_omap": 6043083,
> > osd.12:   "num_pgmeta_omap": 18666776,
> > osd.13:    "num_pgmeta_omap": 615846,
> > osd.14:    "num_pgmeta_omap": 13190188,
> >
> > Thanks a lot!
> >
> >
> >
> --
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux