cp taking 100% cpu and never terminating

Mickey Mazarick <mic@xxxxxxxxxxxxxxxxxx> · Sun, 11 May 2008 17:43:44 -0400

Something odd is happening when I run a shell script with cp commands in 
it. This happens infrequently but I have to reboot the system to get my 
processor back. I'm never taring or copying more than 50 megs of data.

It either hangs on a command like:
cp --reply=yes /usr/src/linux-${kernver}/.config 
/tftpboot/node_root/boot/config-${kernver}
or
tar cf - etc | gzip > /tftpboot/node_root/drbl_ssi/template_etc.tgz

when I do a top I see:
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
1603 root      20   0 54160 1616  508 R  100  0.0  33:02.72 cp
(100% cpu time)

I'm unable to kill that process in any way, but I can kill the shell 
script that spawned it. The CP command is still running.

I see the below errors on the client:
2008-05-11 17:02:32 E [client-protocol.c:1238:client_flush] system1: : 
returning EBADFD
2008-05-11 17:02:32 E [afr.c:2623:afr_flush_cbk] afr1: 
(path=/scripts/gluster/afrheal.sh child=system1) op_ret=-1 op_errno=77
2008-05-11 17:02:32 W [client-protocol.c:1296:client_close] system1: no 
valid fd found, returning
2008-05-11 17:02:32 W [client-protocol.c:1296:client_close] system-ns1: 
no valid fd found, returning

My client and server specs are identical to:
http://www.gluster.org/docs/index.php/Simple_High_Availability_Storage_with_GlusterFS_1.3

This happens equally over ib-verbs and tcp transports.

--