Re: [Gluster-devel] Gluster + Infiniband + 3.x kernel -> hard crash?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Thu, Apr 7, 2016 at 2:02 AM, Glomski, Patrick <patrick.glomski@xxxxxxxxxxxxx> wrote:
We run gluster 3.7 in a distributed replicated setup. Infiniband (tcp) links the gluster peers together and clients use the ethernet interface.

This setup is stable running CentOS 6.x and using the most recent infiniband drivers provided by Mellanox. Uptime was 170 days when we took it down to wipe the systems and update to CentOS 7.

When the exact same setup is loaded onto a CentOS 7 machine (minor setup differences, but basically the same; setup is handled by ansible), the peers will (seemingly randomly) experience a hard crash and need to be power-cycled. There is no output on the screen and nothing in the logs. After rebooting, the peer reconnects, heals whatever files it missed, and everything is happy again. Maximum uptime for any given peer is 20 days. Thanks to the replication, clients maintain connectivity, but from a system administration perspective it's driving me crazy!

We run other storage servers with the same infiniband and CentOS7 setup except that they use NFS instead of gluster. NFS shares are served through infiniband to some machines and ethernet to others.

Is it possible that gluster's (and only gluster's) use of the infiniband kernel module to send tcp packets to its peers on a 3 kernel is causing the system to have a hard crash?

Please note that Gluster is only a "userspace" consumer of infiniband. So, at least in "theory" it shouldn't result in kernel panic. However infiniband also allows userspace programs to do somethings which can be done only by kernel (like pinning pages to a specific address). I am not very familiar with internals of infiniband and hence cannot authoritatively comment on whether kernel panic is possible/impossible. Some one with an understanding of infiniband internals would be in a better position to comment on this.


Pretty specific problem and it doesn't make much sense to me, but that's sure where the evidence seems to point.

Anyone running CentOS 7 gluster arrays with infiniband out there to confirm that it works fine for them? Gluster devs care to chime in with a better theory? I'd love for this random crashing to stop.

Thanks,
Patrick

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel



--
Raghavendra G
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux