Re: Migration Nautilus to Pacifi : Very high latencies (EC profile)

David Orman <ormandj@xxxxxxxxxxxx> · Tue, 17 May 2022 13:29:23 -0500

We don't have any that wouldn't have the problem. That said, we've already
got a PR out for the 16.2.8 issue we encountered, so I would expect a
relatively quick update assuming no issues are found during testing.

On Tue, May 17, 2022 at 1:21 PM Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx>
wrote:

> What was the largest cluster that you upgraded that didn't exhibit the new
> issue in 16.2.8 ? Thanks.
>
> Respectfully,
>
> *Wes Dillingham*
> wes@xxxxxxxxxxxxxxxxx
> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>
>
> On Tue, May 17, 2022 at 10:24 AM David Orman <ormandj@xxxxxxxxxxxx> wrote:
>
>> We had an issue with our original fix in 45963 which was resolved in
>> https://github.com/ceph/ceph/pull/46096. It includes the fix as well as
>> handling for upgraded clusters. This is in the 16.2.8 release. I'm not
>> sure
>> if it will resolve your problem (or help mitigate it) but it would be
>> worth
>> trying.
>>
>> Head's up on 16.2.8 though, see the release thread, we ran into an issue
>> with it on our larger clusters: https://tracker.ceph.com/issues/55687
>>
>> On Tue, May 17, 2022 at 3:44 AM BEAUDICHON Hubert (Acoss) <
>> hubert.beaudichon@xxxxxxxx> wrote:
>>
>> > Hi Josh,
>> >
>> > I'm working with Stéphane and I'm the "ceph admin" (big words ^^) in our
>> > team.
>> > So, yes, as part of the upgrade we've done the offline repair to split
>> the
>> > omap by pool.
>> > The quick fix is, as far as I know, still disable on the default
>> > properties.
>> >
>> > On the I/O and CPU load, between Nautilus and Pacific, we haven't seen a
>> > really big change, just an increase in disk latency and in the end, the
>> > "ceph read operation" metric drop from 20K to 5K or less.
>> >
>> > But yes, a lot of slow IOPs were emerging as time passed.
>> >
>> > At this time, we have completely out one of our data node, and recreate
>> > from scratch 5 of 8 OSD deamons (DB on SSD, data on spinning drive).
>> > The result seems very good at this moment (we're seeing better metrics
>> > than under Nautilus).
>> >
>> > Since recreation, I have change 3 parameters :
>> > bdev_async_discard => osd : true
>> > bdev_enable_discard => osd : true
>> > bdev_aio_max_queue_depth => osd: 8192
>> >
>> > The first two have been extremely helpful for our SSD Pool, even with
>> > enterprise grade SSD, the "trim" seems to have rejuvenate our pool.
>> > The last one was set in response of messages in the newly create OSD :
>> > "bdev(0x55588e220400 <path to block>) aio_submit retries XX"
>> > After changing it and restarting the OSD process, messages were gone,
>> and
>> > it seems to have a beneficial effect on our data node.
>> >
>> > I've seen that the 16.2.8 was out yesterday, but I'm a little confused
>> on :
>> > [Revert] bluestore: set upper and lower bounds on rocksdb omap iterators
>> > (pr#46092, Neha Ojha)
>> > bluestore: set upper and lower bounds on rocksdb omap iterators
>> (pr#45963,
>> > Cory Snyder)
>> >
>> > (theses two lines seems related to
>> https://tracker.ceph.com/issues/55324).
>> >
>> > One step forward, one step backward ?
>> >
>> > Hubert Beaudichon
>> >
>> >
>> > -----Message d'origine-----
>> > De : Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx>
>> > Envoyé : lundi 16 mai 2022 16:56
>> > À : stéphane chalansonnet <schalans@xxxxxxxxx>
>> > Cc : ceph-users@xxxxxxx
>> > Objet :  Re: Migration Nautilus to Pacifi : Very high
>> > latencies (EC profile)
>> >
>> > Hi Stéphane,
>> >
>> > On Sat, May 14, 2022 at 4:27 AM stéphane chalansonnet <
>> schalans@xxxxxxxxx>
>> > wrote:
>> > > After a successful update from Nautilus to Pacific on Centos8.5, we
>> > > observed some high latencies on our cluster.
>> >
>> > As a part of this upgrade, did you also migrate the OSDs to sharded
>> > rocksdb column families? This would have been done by setting
>> bluestore's
>> > "quick fix on mount" setting to true or by issuing a
>> "ceph-bluestore-tool
>> > repair" offline, perhaps in response to a BLUESTORE_NO_PER_POOL_OMAP
>> > warning post-upgrade.
>> >
>> > I ask because I'm wondering if you're hitting
>> > https://tracker.ceph.com/issues/55324, for which there is a fix coming
>> in
>> > 16.2.8. If you inspect the nodes and disks involved in your EC pool, are
>> > you seeing high read or write I/O? High CPU usage?
>> >
>> > Josh
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>> > email to ceph-users-leave@xxxxxxx
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx