[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2014-05-12 at 13:43 -0400, Dave Jones wrote:
> heh, I knew I'd forget something. Hopefully "cc'ing the trinity list"
> was the only thing this time around..


Hi Dave,

I gave this spin on a system of mine here.

I'm consistently ending up with a watchdog that is spinning using 100% cpu.

strace shows it spinning calling kill:

kill(17833, SIG_0)                      = -1 ESRCH (No such process)
kill(17833, SIG_0)                      = -1 ESRCH (No such process)
kill(17833, SIG_0)                      = -1 ESRCH (No such process)
kill(17833, SIG_0)                      = -1 ESRCH (No such process)
...

Which gdb agrees with:

(gdb) bt
#0  0x1001c790 in kill@plt ()
#1  0x10001984 in __check_main () at watchdog.c:158
#2  0x10010510 in check_main_alive () at watchdog.c:185
#3  watchdog () at watchdog.c:407
#4  init_watchdog () at watchdog.c:484
#5  0x10001d04 in main (argc=1, argv=<optimized out>) at trinity.c:128


It's looping around:

183			while (shm->mainpid != 0) {
(gdb) n
185				ret = __check_main();
(gdb)
186				if (ret == TRUE) {
(gdb)
183			while (shm->mainpid != 0) {
(gdb)
185				ret = __check_main();
(gdb)
186				if (ret == TRUE) {
(gdb)
183			while (shm->mainpid != 0) {
(gdb)
185				ret = __check_main();
(gdb)
186				if (ret == TRUE) {


shm->mainpid is 17833, which agrees with strace, and that process is indeed
no longer running.

We are bailing out of __check_main() before clearing shm->mainpid because we
see that we are already exiting.

        if (ret == -1) {
                /* Are we already exiting ? */
                if (shm->exit_reason != STILL_RUNNING)
                        return FALSE;

                /* No. Check what happened. */
                if (errno == ESRCH) {


161			if (shm->exit_reason != STILL_RUNNING)
(gdb) print shm->exit_reason
$6 = EXIT_FORK_FAILURE

It looks like the only other place shm->mainpid is written is in
trinity.c:main(), which is dead. So we are stuck forever as far as I can tell.


The last thing in trinity.log is:

[main] couldn't create child! (Cannot allocate memory)

>From main.c:69:

	output(0, "couldn't create child! (%s)\n", strerror(errn    o));
	shm->exit_reason = EXIT_FORK_FAILURE;
	exit(EXIT_FAILURE);


So we exited directly and didn't let the code in main() clear shm->mainpid.

Not sure what the correct fix is. We could drop the check of shm->exit_reason
in __check_main(), but presumably that is there for a good reason.

cheers


--
To unsubscribe from this list: send the line "unsubscribe trinity" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux