Re: Re: cp taking 100% cpu and never terminating

"Raghavendra G" <raghavendra.hg@xxxxxxxxx> · Mon, 16 Jun 2008 07:51:05 +0400

Hi Mickey,
Is it possible to attach to glusterfs process using gdb, while cp is hung
and get a backtrace?
#ps aux | grep -i glusterfs
# gdb -p <glusterfs-process-id>
and in gdb,
gdb> bt

Also,
It would be helpful, If you can get a backtrace of cp also.
#ps aux | grep -i cp
# gdb -p <cp-process-id>
gdb> bt

Also, I am curious to know what do the --reply option to cp does.

regards,

On Sun, Jun 15, 2008 at 12:12 AM, Mickey Mazarick <mic@xxxxxxxxxxxxxxxxxx>
wrote:

> I'm still seeing the problem described below. It only happens over the
> ibverbs transport and very infrequently tcp. This is an intermittent
> problem, but happens quite frequently over ibverbs. It will use all the
> processing power on a single core of the client machine. I can repeat the
> command but eventually the machine will lock with all processors doing a cp
> or a tar command. We see it on both kernel 2.6.18 and 2.6.24. Has anyone
> there been able to replicate it?
>
> Thanks!
> -Mickey Mazarick
>
>
>
> Mickey Mazarick wrote:
>
>> Something odd is happening when I run a shell script with cp commands in
>> it. This happens infrequently but I have to reboot the system to get my
>> processor back. I'm never taring or copying more than 50 megs of data.
>>
>> It either hangs on a command like:
>> cp --reply=yes /usr/src/linux-${kernver}/.config
>> /tftpboot/node_root/boot/config-${kernver}
>> or
>> tar cf - etc | gzip > /tftpboot/node_root/drbl_ssi/template_etc.tgz
>>
>> when I do a top I see:
>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> 1603 root      20   0 54160 1616  508 R  100  0.0  33:02.72 cp
>> (100% cpu time)
>>
>> I'm unable to kill that process in any way, but I can kill the shell
>> script that spawned it. The CP command is still running.
>>
>> I see the below errors on the client:
>> 2008-05-11 17:02:32 E [client-protocol.c:1238:client_flush] system1: :
>> returning EBADFD
>> 2008-05-11 17:02:32 E [afr.c:2623:afr_flush_cbk] afr1:
>> (path=/scripts/gluster/afrheal.sh child=system1) op_ret=-1 op_errno=77
>> 2008-05-11 17:02:32 W [client-protocol.c:1296:client_close] system1: no
>> valid fd found, returning
>> 2008-05-11 17:02:32 W [client-protocol.c:1296:client_close] system-ns1: no
>> valid fd found, returning
>>
>> My client and server specs are identical to:
>>
>> http://www.gluster.org/docs/index.php/Simple_High_Availability_Storage_with_GlusterFS_1.3
>>
>> This happens equally over ib-verbs and tcp transports.
>>
>>
>
> --
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>

-- 
Raghavendra G

A centipede was happy quite, until a toad in fun,
Said, "Prey, which leg comes after which?",
This raised his doubts to such a pitch,
He fell flat into the ditch,
Not knowing how to run.
-Anonymous