OOM issue in openstack Cinder - GlusterFS CI env

Deepak Shetty <dpkshetty@xxxxxxxxx> · Sat, 21 Feb 2015 22:00:18 +0530

Hi All,
  I am looking for some help from glusterfs side for the Out of Memory (OOM) issue
we are seeing when using GlusterFS as a storage backend for openstack Cinder (block storage service)

    openstack has a upstream CI env managed by openstack infra team, where we added a new job that creates a devstack env (openstack all in one for newbies) and configures the block service (Cinder) with GlusterFS as storage backend. Once setup, the CI job runs openstack tempest (Integration test suite of openstack) that does API level testing of the whole openstack env. 

    As part of that testing, ~1.5 to 2 hours into the run, the tempest job (VM) hits OOM and the kernel oom-killer kills the process with the max memory to reduce memory pressure.

    The tempest job is based on CentOS 7 and uses glusterfs 3.6.2 as the storage backend for openstack Cinder

    The openstack-dev thread @ http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861 has details including links to the logs captured from the tempest jobs. 

Per the openstack infra folks, they have other jobs based off CentOS7 that doesn't hit this issue, the only change we are adding is configuring cinder with glusterfs when this happens, so right now glusterfs is in the spotlight for causing this. 

I am looking thru the logs trying to co-relate syslog, dstat, tempest info to figure the state of the system and what was happening at and before the OOM to get any clues, but wanted to start this thread in gluster-devel to see if others can pitch in with their ideas to accelerate the debug and help root cause.

Also pasting relevant part of the chat log I had with infra folks ...

Feb 20 21:46:28 <sdague>        deepakcs: you are at 70% wait time at the end of that

Feb 20 21:46:37 <sdague>        so your io system is just gone bonkers

Feb 20 21:47:14 <fungi> sdague: that would explain why the console login prompt and ssh daemon both stopped working, and the df loop in had going in my second ssh session hung around the same time
Feb 20 21:47:26 <sdague>        yeh, dstat even says it's skipping ticks there
Feb 20 21:47:29 <sdague>        for that reason

Feb 20 21:47:48 <fungi> likely complete i/o starvation for an extended period at around that timeframe
Feb 20 21:48:05 <fungi> that would also definitely cause jenkins to give up on the worker if it persisted for very long at all

Feb 20 21:48:09 <sdague>        yeh, cached memory is down to double digit M

Feb 20 21:49:21 <sdague>        deepakcs: so, honestly, what it means to me is that glusterfs is may be too inefficient to function in this environment
Feb 20 21:49:34 <sdague>        because it's kind of a constrained environment

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel