Re: Please discuss about Slow Peering

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



If using jumbo frames, also ensure that they're consistently enabled on all OS instances and network devices. 

> On May 16, 2024, at 09:30, Frank Schilder <frans@xxxxxx> wrote:
> 
> This is a long shot: if you are using octopus, you might be hit by this pglog-dup problem: https://docs.clyso.com/blog/osds-with-unlimited-ram-growth/. They don't mention slow peering explicitly in the blog, but its also a consequence because the up+acting OSDs need to go through the PG_log during peering.
> 
> We are also using octopus and I'm not sure if we have ever seen slow ops caused by peering alone. It usually happens when a disk cannot handle load under peering. We have, unfortunately, disks that show random latency spikes (firmware update pending). You can try to monitor OPS latencies for your drives when peering and look for something that sticks out. People on this list were reporting quite bad results for certain infamous NVMe brands. If you state your model numbers, someone else might recognize it.
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: 서민우 <smw940219@xxxxxxxxx>
> Sent: Thursday, May 16, 2024 7:39 AM
> To: ceph-users@xxxxxxx
> Subject:  Please discuss about Slow Peering
> 
> Env:
> - OS: Ubuntu 20.04
> - Ceph Version: Octopus 15.0.0.1
> - OSD Disk: 2.9TB NVMe
> - BlockStorage (Replication 3)
> 
> Symptom:
> - Peering when OSD's node up is very slow. Peering speed varies from PG to
> PG, and some PG may even take 10 seconds. But, there is no log for 10
> seconds.
> - I checked the effect of client VM's. Actually, Slow queries of mysql
> occur at the same time.
> 
> There are Ceph OSD logs of both Best and Worst.
> 
> Best Peering Case (0.5 Seconds)
> 2024-04-11T15:32:44.693+0900 7f108b522700  1 osd.7 pg_epoch: 27368 pg[6.8]
> state<Start>: transitioning to Primary
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
> state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
> start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11],
> acting_primary 7 -> 6, up_primary 7 -> 6, role 0 -> -1, features acting
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27377 pg[6.8]
> state<Start>: transitioning to Primary
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27377 pg[6.8]
> start_peering_interval up [6,11] -> [7,6,11], acting [6,11] -> [7,6,11],
> acting_primary 6 -> 7, up_primary 6 -> 7, role -1 -> 0, features acting
> 
> Worst Peering Case (11.6 Seconds)
> 2024-04-11T15:32:45.169+0900 7f108b522700  1 osd.7 pg_epoch: 27377 pg[30.20]
> state<Start>: transitioning to Stray
> 2024-04-11T15:32:45.169+0900 7f108b522700  1 osd.7 pg_epoch: 27377 pg[30.20]
> start_peering_interval up [0,1] -> [0,7,1], acting [0,1] -> [0,7,1],
> acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting
> 2024-04-11T15:32:46.173+0900 7f108b522700  1 osd.7 pg_epoch: 27378 pg[30.20]
> state<Start>: transitioning to Stray
> 2024-04-11T15:32:46.173+0900 7f108b522700  1 osd.7 pg_epoch: 27378 pg[30.20]
> start_peering_interval up [0,7,1] -> [0,7,1], acting [0,7,1] -> [0,1],
> acting_primary 0 -> 0, up_primary 0 -> 0, role 1 -> -1, features acting
> 2024-04-11T15:32:57.794+0900 7f108b522700  1 osd.7 pg_epoch: 27390 pg[30.20]
> state<Start>: transitioning to Stray
> 2024-04-11T15:32:57.794+0900 7f108b522700  1 osd.7 pg_epoch: 27390 pg[30.20]
> start_peering_interval up [0,7,1] -> [0,7,1], acting [0,1] -> [0,7,1],
> acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting
> 
> *I wish to know about*
> - Why some PG's take 10 seconds until Peering finishes.
> - Why Ceph log is quiet during peering.
> - Is this symptom intended in Ceph.
> 
> *And please give some advice,*
> - Is there any way to improve peering speed?
> - Or, Is there a way to not affect the client when peering occurs?
> 
> P.S
> - I checked the symptoms in the following environments.
> -> Octopus Version, Reef Version, Cephadm, Ceph-Ansible
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux