Hi David, There’s a good amount of backstory to our configuration, but I’m happy to report I found the source of my problem. We were applying some “optimizations” for our 10GbE via sysctl, including disabling net.ipv4.tcp_sack. Re-enabling net.ipv4.tcp_sack resolved the issue. Thanks, Tom From: David Turner [mailto:david.turner@xxxxxxxxxxxxxxxx]
Why are you running Raid 6 osds? Ceph's usefulness is a lot of osds that can fail and be replaced. With your processors/ram, you should be running these as individual
osds. That will utilize your dual processor setup much more. Ceph is optimal for 1 core per osd. Extra cores are more or less wasted in the storage node. You only have 2 storage nodes, so you can't utilize a lot of the benefits of Ceph. Your setup looks
like you're much better suited for a Gluster cluster instead of a Ceph cluster. I don't know what your needs are, but that's what it looks like from here.
From: Helander, Thomas [Thomas.Helander@xxxxxxxxxxxxxx] Hi David, Thanks for the quick response and suggestion. I do have just a basic network config (one network, no VLANs) and am able to ping between the storage servers using hostnames and IPs. Thanks, Tom From: David Turner [mailto:david.turner@xxxxxxxxxxxxxxxx]
This could be explained by your osds not being able to communicate with each other. We have 2 vlans between our storage nodes, the public and private networks for
ceph to use. We added 2 new nodes in a new rack on new switches and as soon as we added a single osd for one of them to the cluster, the peering never finished and we had a lot of blocked requests that never went away.
From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx]
on behalf of Helander, Thomas [Thomas.Helander@xxxxxxxxxxxxxx] Hi, I’m running a three server cluster (one monitor, two OSD) and am having a problem where after adding the second OSD server, my read rate drops significantly and eventually the reads stall (writes are improved as
expected). Attached is a log of the rados benchmarks for the two configurations and below is my hardware configuration. I’m not using replicas (capacity is more important than uptime for our use case) and am using a single 10GbE network. The pool (rbd) is
configured with 128 placement groups. I’ve checked the CPU utilization of the ceph-osd processes and they all hover around 10% until the stall. After the stall, the CPU usage is 0% and the disks all show zero operations via iostat. Iperf reports 9.9Gb/s
between the monitor and OSD servers. I’m looking for any advice/help on how to identify the source of this issue as my attempts so far have proven fruitless… Monitor server: 2x E5-2680V3 32GB DDR4 2x 4TB HDD in RAID1 on an Avago/LSI 3108 with Cachevault, configured as write-back 10GbE OSD servers: 2x E5-2680V3 128GB DDR4 2x 8+2 RAID6 using 8TB SAS12 drives on an Avago/LSI 9380 controller with Cachevault, configured as write-back. - Each RAID6 is an OSD 10GbE Thanks, Tom Helander KLA-Tencor
CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidential information. If you are not the intended
recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have
received this transmission in error, please immediately notify us by reply e-mail at
thomas.helander@xxxxxxxxxxxxxx or
by telephone at (408) 875-7819, and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you. |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com