Hi everyone,
I'm managing a Ceph Quincy 17.2.5 cluster, waiting to upgrade it to
version 17.2.7, composed and configured as follows:
- 16 identical nodes 256 GB RAM, 32 CPU Cores (64 threads), 12 x rotary
HDD (BLOCK) + 4 x Sata SSD (RocksDB/WAL)
- Erasure Code 11+4 (Jerasure)
- 10 x S3 RGW on dedicated nodes (5 physical nodes)
- 3 x full SSD dedicated nodes for replicated S3 pools
- 2 x 10 Gbit Public network (LACP) + 2 x 10 Gbit cluster network (LACP)
- On all nodes: Ubuntu 20.04.4 LTS Operating System updated
- Ceph deployed on containers on Docker CE (docker-ce
5:20.10.17~3-0~ubuntu-focal).
All pools, except the EC data pool, are configured with replication 3
and stored in dedicated SSD devices on 3 dedicated nodes to guarantee
the necessary performance.
We encountered a constant and random problem relating to the
availability of access to bucket data and many slow_ops relating to
rotary OSDs (data-pools) not caused by saturation of physical devices
nor by the availability of CPU/RAM on all nodes.
It sometimes happens that slow_ops caused by some requests to some PCs
in a random way are reported.
The cluster is currently in the recovery/rebalance phase for the
reconstruction of 3 HDDs that we had to recreate from scratch (all 3
HDDs are physically on the same node).
By doing some analysis of the events we verified the following from the
status of some OSDs impacted by slow_ops:
20/05/2024 10:19 •
"description": "osd_op(client.186021790.0:57620 29.258s0
29:1a5928ea:::31497ca8-e7d6-4e53-b150-91f9ac02ac67.246100.6329_storage%2ffirstMemories%2f1010543%2f:head
[getxattrs,stat] snapc 0=[]
ondisk+read+known_if_redirected+supports_pool_eio e481205)",
"initiated_at": "2024-05-16T15:03:06.015956+0000",
"age": 963.83795990199997,
"duration": 963.83819621700002,
"type_data": {
"flag_point": "delayed",
"client_info": {
"client": "client.186021790",
"client_addr": "10.151.11.11:0/3913909849",
"tid": 57620
},
"events": [
{
"event": "initiated",
"time": "2024-05-16T15:03:06.015956+0000",
"duration": 0
},
{
"event": "throttled",
"time": "2024-05-16T15:03:06.015956+0000",
"duration": 0
},
{
"event": "header_read",
"time": "2024-05-16T15:03:06.015954+0000",
"duration": 4294967295.9999986
},
{
"event": "all_read",
"time": "2024-05-16T15:03:06.015961+0000",
"duration": 7.2300000000000002e-06
},
{
"event": "dispatched",
"time": "2024-05-16T15:03:06.015962+0000",
"duration": 1.063e-06
},
{
"event": "queued_for_pg",
"time": "2024-05-16T15:03:06.015966+0000",
"duration": 3.332e-06
},
{
"event": "reached_pg",
"time": "2024-05-16T15:03:06.015992+0000",
"duration": 2.6078e-05
},
{
"event": "waiting for readable",
"time": "2024-05-16T15:03:06.016002+0000",
"duration": 1.0348e-05
}
]
}
}
],
"num_ops": 6
}
###########
{
"event": "reached_pg",
"time": "2024-05-16T12:43:11.694220+0000",
"duration": 480.97258642200001
},
In essence the operations remain suspended in this condition:
"event": "waiting for readable"
trying to access some PGs.
We intend to update the version as soon as the recovery/rebalance is
completed.
Does anyone have any idea what checks I could do to analyze the problem
more thoroughly?
I can't define whether the problem could be the use of the EC or whether
the data written in some buckets is in a "non-standard" condition, which
causes the access to wait for some reason.
Thank you all for your kindness.
Greetings
Andrea Martra
--
--
Andrea Martra
+39 393 9048451
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx