Am 03.10.2013 21:33, schrieb Toralf Förster: > On 10/03/2013 09:20 PM, Richard Weinberger wrote: >> Am 03.10.2013 21:16, schrieb Toralf Förster: >>> On 10/03/2013 09:04 PM, Richard Weinberger wrote: >>>> Am 03.10.2013 20:54, schrieb Toralf Förster: >>>>> On 10/02/2013 09:55 PM, Richard Weinberger wrote: >>>>>> On Wed, Oct 2, 2013 at 8:30 PM, Toralf Förster <toralf.foerster@xxxxxx> wrote: >>>>>>> Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time >>>>>>> within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...) >>>>>>> yields into this konsole message : >>>>>>> >>>>>>> * Starting local >>>>>>> net.core.warnings = 0 [ ok ] >>>>>>> BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031] >>>>>>> >>>>>>> >>>>>>> and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core. >>>>>>> 2 subsequent made back traces made with >>>>>>> >>>>>>> $> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt >>>>>>> >>>>>>> shows nearly a similar position around __get_user_pages() - both are attached. >>>>>>> >>>>>>> I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is >>>>>>> expected or if it points to an issue. >>>>>>> >>>>>>> >>>>>>> FWIW the last lines of trinity log were : >>>>>>> >>>>>>> >>>>>>> [2031] [94] setsid() = 2031 >>>>>>> [2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted) >>>>>>> [2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000 >>>>>>> [2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted) >>>>>>> [2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument) >>>>>>> [2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol) >>>>>>> [2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument) >>>>>>> [2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3 >>>>>>> [2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument) >>>>>>> [2031] [103] sigpending(set=0x40025000) = 0 >>>>>>> [2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument) >>>>>>> [2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument) >>>>>>> [2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument) >>>>>>> [2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument) >>>>>>> [2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0 >>>>>>> [2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted) >>>>>>> [2031] [110] munlock(addr=0x40025000, len=34) = 0 >>>>>>> [2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument) >>>>>>> [2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument) >>>>>>> [2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted) >>>>>>> [2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument) >>>>>>> [2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address) >>>>>>> [2031] [116] rename(oldname=4, newname=8) = -1 (Bad address) >>>>>>> [2031] [117] inotify_init() = 654 >>>>>>> [2031] [118] getgid() = 100 >>>>>>> [2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument) >>>>>>> [2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument) >>>>>>> [2031] [121] timerfd_create(clockid=0, flags=0) = 655 >>>>>>> [2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory) >>>>>>> [2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address) >>>>>>> [2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument) >>>>>>> [2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address) >>>>>>> [2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory) >>>>>>> [2031] [127] ssetmask(newmask=0x88000092) = 0 >>>>>>> [2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address) >>>>>>> [2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted) >>>>>>> [2031] [130] fchdir(fd=397) = -1 (Not a directory) >>>>>>> [2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists) >>>>>>> [2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range) >>>>>>> [2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2. >>>>>> >>>>>> Reading your gdb backtraces show that schedule_timeout() got called >>>>>> with a negative value. >>>>>> Looks like an integer overflow. >>>>>> The soft-lockup might also origin from that (very big integer which >>>>>> did not overflow jet) >>>>>> >>>>> >>>>> If the culprit is solved by this patch I'd like to send it out. But I'm >>>>> unsure whether it catches the culprit or if it just covers the root cause. >>>> >>>> I fear your Patch will not fix the issue. >>>> >>>> Does the issue only trigger on 32bit UMLs? >>> No diea, I do only have a 32 bit system here (both host and client). >>> >>>> How long does it take till trinity hits it? >>> a command like >>> >>> $> ssh tfoerste@trinity "rm -rf t3; mkdir t3; cd t3; trinity -C4" >>> >>> usually needs 10 till 15 min to trigger the issue. With just 1 trinity >>> task (-C1) however it needs often a hour or more. >> >> That's good. :-) >> You can place some printk()s into balance_dirty_pages() and observe the values >> of period, max_pause, min_pause, etc... >> Maybe this will give us a clue. > > Will do it again. I did it before already but period was the only thing > which becames otherwise negative - > and because pages_dirtied was often in the range of 20 - 30 and perioad > around -3000 Did you print it before or after if (current->dirty_paused_when) pause -= now - current->dirty_paused_when; I'm also wondering why if (pause < min_pause) { did not trigger or did it? > I just wondered if HZ is casts as a short int in the calculation ? > BTW printk doesn't showed anything in the syslog (ok, I used > printk(KERN_ERR ...) syslog or dmesg? Thanks, //richard -- To unsubscribe from this list: send the line "unsubscribe trinity" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html