Something odd is happening when I run a shell script with cp commands in
it. This happens infrequently but I have to reboot the system to get my
processor back. I'm never taring or copying more than 50 megs of data.
It either hangs on a command like:
cp --reply=yes /usr/src/linux-${kernver}/.config
/tftpboot/node_root/boot/config-${kernver}
or
tar cf - etc | gzip > /tftpboot/node_root/drbl_ssi/template_etc.tgz
when I do a top I see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1603 root 20 0 54160 1616 508 R 100 0.0 33:02.72 cp
(100% cpu time)
I'm unable to kill that process in any way, but I can kill the shell
script that spawned it. The CP command is still running.
I see the below errors on the client:
2008-05-11 17:02:32 E [client-protocol.c:1238:client_flush] system1: :
returning EBADFD
2008-05-11 17:02:32 E [afr.c:2623:afr_flush_cbk] afr1:
(path=/scripts/gluster/afrheal.sh child=system1) op_ret=-1 op_errno=77
2008-05-11 17:02:32 W [client-protocol.c:1296:client_close] system1: no
valid fd found, returning
2008-05-11 17:02:32 W [client-protocol.c:1296:client_close] system-ns1:
no valid fd found, returning
My client and server specs are identical to:
http://www.gluster.org/docs/index.php/Simple_High_Availability_Storage_with_GlusterFS_1.3
This happens equally over ib-verbs and tcp transports.
--