From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx] On Thu, Jun 29, 2017 at 10:30 AM Nick Fisk <nick@xxxxxxxxxx> wrote:
Nick, are you using any network aggregation, LACP? Can you drop to a simplest possible configuration to make sure there's nothing on the network switch side? Hi Alex, The OSD nodes are active/backup bond and the active Nic on each one, all goes into the same switch. The NFS gateways are currently VM’s, but again the hypervisor is using the Nic on the same switch. The cluster and public networks are vlans on the same Nic and I don’t get any alerts from monitoring/pacemaker to suggest there are comms issues. But I will look into getting some ping logs done to see if they reveal anything. Do you check the ceph.log for any anomalies? Yep, completely clean Any occurrences on OSD nodes, anything in their OSD logs or syslogs? Not that I can see. I’m using cache tiering, so all IO travels through a few OSD’s. I guess this might make it easier to try and see whats going on. But the random nature of it, means it’s not always easy to catch. Aany odd page cache settings on the clients? The only customizations on the clients are readahead, some TCP tunings and min free kbytes. Alex
-- -- Alex Gorbachev Storcium |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com