I'm still seeing the problem described below. It only happens over the
ibverbs transport and very infrequently tcp. This is an intermittent
problem, but happens quite frequently over ibverbs. It will use all the
processing power on a single core of the client machine. I can repeat
the command but eventually the machine will lock with all processors
doing a cp or a tar command. We see it on both kernel 2.6.18 and 2.6.24.
Has anyone there been able to replicate it?
Thanks!
-Mickey Mazarick
Mickey Mazarick wrote:
Something odd is happening when I run a shell script with cp commands
in it. This happens infrequently but I have to reboot the system to
get my processor back. I'm never taring or copying more than 50 megs
of data.
It either hangs on a command like:
cp --reply=yes /usr/src/linux-${kernver}/.config
/tftpboot/node_root/boot/config-${kernver}
or
tar cf - etc | gzip > /tftpboot/node_root/drbl_ssi/template_etc.tgz
when I do a top I see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1603 root 20 0 54160 1616 508 R 100 0.0 33:02.72 cp
(100% cpu time)
I'm unable to kill that process in any way, but I can kill the shell
script that spawned it. The CP command is still running.
I see the below errors on the client:
2008-05-11 17:02:32 E [client-protocol.c:1238:client_flush] system1: :
returning EBADFD
2008-05-11 17:02:32 E [afr.c:2623:afr_flush_cbk] afr1:
(path=/scripts/gluster/afrheal.sh child=system1) op_ret=-1 op_errno=77
2008-05-11 17:02:32 W [client-protocol.c:1296:client_close] system1:
no valid fd found, returning
2008-05-11 17:02:32 W [client-protocol.c:1296:client_close]
system-ns1: no valid fd found, returning
My client and server specs are identical to:
http://www.gluster.org/docs/index.php/Simple_High_Availability_Storage_with_GlusterFS_1.3
This happens equally over ib-verbs and tcp transports.
--