Re: Very slow snaptrim operations blocking client I/O

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Victor,

out of curiosity, did you upgrade the cluster recently to octopus? We and others observed this behaviour when following one out of two routes to upgrade OSDs. There was a thread "Octopus OSDs extremely slow during upgrade from mimic", which seems to have been lost with the recent mail list outage. If it is relevant, I could copy pieces I have into here.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Victor Rodriguez <vrodriguez@xxxxxxxxxxxxx>
Sent: 29 January 2023 22:40:46
To: ceph-users@xxxxxxx
Subject:  Re: Very slow snaptrim operations blocking client I/O

Looks like this is going to take a few days. I hope to manage the
available performance for VMs with osd_snap_trim_sleep_ssd.

I'm wondering if after that long snaptrim process you went through, was
your cluster was stable again and snapshots/snaptrims did work properly?


On 1/29/23 16:01, Matt Vandermeulen wrote:
> I should have explicitly stated that during the recovery, it was still
> quite bumpy for customers.  Some snaptrims were very quick, some took
> what felt like a really long time.  This was however a cluster with a
> very large number of volumes and a long, long history of snapshots.
> I'm not sure what the difference will be from our case versus a single
> large volume with a big snapshot.
>
>
>
> On 2023-01-28 20:45, Victor Rodriguez wrote:
>> On 1/29/23 00:50, Matt Vandermeulen wrote:
>>> I've observed a similar horror when upgrading a cluster from
>>> Luminous to Nautilus, which had the same effect of an overwhelming
>>> amount of snaptrim making the cluster unusable.
>>>
>>> In our case, we held its hand by setting all OSDs to have zero max
>>> trimming PGs, unsetting nosnaptrim, and then slowly enabling
>>> snaptrim a few OSDs at a time.  It was painful to babysit but it
>>> allowed the cluster to catch up without falling over.
>>
>>
>> That's an interesting approach! Thanks!
>>
>> On preliminary tests seems that just running snaptrim on a single PG
>> of a single OSD still makes the cluster barely usable. I have to
>> increase osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable
>> by getting a third of its performance. After a while, a few PG got
>> trimmed and feels like some of them are harder to trim than others,
>> as some need a higher osd_snap_trim_sleep_ssd value to let the
>> cluster perform.
>>
>> I don't know how long this is going to take... Maybe recreating the
>> OSD's and dealing with the rebalance is a better option?
>>
>> There's something ugly going on here... I would really like to put my
>> finger on it.
>>
>>
>>> On 2023-01-28 19:43, Victor Rodriguez wrote:
>>>> After some investigation this is what I'm seeing:
>>>>
>>>> - OSD processes get stuck at least at 100% CPU if I ceph osd unset
>>>> nosnaptrim. They keep at 100% CPU even if I ceph osd set
>>>> nosnaptrim. They stayed like that for at least 26 hours. Some quick
>>>> benchmarks don't show a reduction of the performance of the cluster.
>>>>
>>>> - Restarting a OSD lowers it's CPU usage to typical levels, as
>>>> expected, but it also usually sets some other OSD in a different
>>>> host to typical levels.
>>>>
>>>> - All OSDs in this cluster take quite a bit to start: between 35 to
>>>> 70 seconds depending on the OSD. Clearly much longer than any other
>>>> OSD in any of my clusters.
>>>>
>>>> - I believe that the size of the rocksdb database is dumped in the
>>>> OSD log when an automatic compact operation is triggered. The "sum"
>>>> sizes of these OSD range between 2.5 and 5.1 GB. Thats way bigger
>>>> that those in any other cluster I have.
>>>>
>>>> - ceph daemon osd.* calc_objectstore_db_histogram is giving values
>>>> for num_pgmeta_omap (I don't know what it is) way bigger than those
>>>> on any other of my clusters for some OSD. Also, values are not
>>>> similar among the OSD which hold the same PGs.
>>>>
>>>> osd.0:    "num_pgmeta_omap": 17526766,
>>>> osd.1:    "num_pgmeta_omap": 2653379,
>>>> osd.2:    "num_pgmeta_omap": 12358703,
>>>> osd.3:    "num_pgmeta_omap": 6404975,
>>>> osd.6:    "num_pgmeta_omap": 19845318,
>>>> osd.7:    "num_pgmeta_omap": 6043083,
>>>> osd.12:   "num_pgmeta_omap": 18666776,
>>>> osd.13:    "num_pgmeta_omap": 615846,
>>>> osd.14:    "num_pgmeta_omap": 13190188,
>>>>
>>>> - Compacting the OSD barely reduces rocksdb size and does not
>>>> reduce num_pgmeta_omap at all.
>>>>
>>>> - This is the only cluster I have were there are some RBD images
>>>> that I mount directly from some clients, that is, they are not
>>>> disks for QEMU/Proxmox VMs. Maybe I have something misconfigured
>>>> related to this?  This cluster is at least two and half years old
>>>> an never had this issue with snaptrims.
>>>>
>>>> Thanks in advance!
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
--
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux