Re: Oopses and invalid addresses under Hatari

Michael Schmitz <schmitzmic@xxxxxxxxx> · Fri, 12 Apr 2019 18:52:00 +1200

Hi Eero,

Am 12.04.2019 um 11:03 schrieb Eero Tamminen:
Hi,

On 4/11/19 5:10 AM, Michael Schmitz wrote:
[...]
OK, I decided to bite the bullet and modify bus_error030() to allow
falling through to do_page_fault if an invalid page read happens while
page faults are disabled.
[...]
Resulting syslog:
[...]

Summary:

* Stack is always shown, but call trace following it is always empty.
  Is call trace explicitly disabled for m68k task list?

No, must be a 030 thing. The output on 060 does show a call trace (at 
least for normal processes).

* Following threads didn't fault:
----------------------------------------------------------------
[31197.540000]   task                PC stack   pid father
[31197.550000] init            S    0     1      0 0x00000000
[31197.620000] kthreadd        S    0     2      0 0x00000000
[31198.020000] ksoftirqd/0     R  running task        0 7      2
[31198.080000] kdevtmpfs       S    0     8      2 0x00000000
[31198.280000] oom_reaper      S    0    12      2 0x00000000
[31198.760000] kswapd0         S    0   200      2 0x00000000
[31198.950000] jbd2/hda3-8     S    0   794      2 0x00000000
[31199.120000] portmap         S    0   982      1 0x00000000
[31199.180000] syslogd         S    0  1070      1 0x00000000
[31199.220000] klogd           R  running task        0 1076      1
[31199.280000] gpm             S    0  1086      1 0x00000000
[31199.350000] inetd           S    0  1091      1 0x00000000
[31199.410000] lpd             S    0  1095      1 0x00000000
[31199.470000] sshd            S    0  1101      1 0x00000000
[31199.520000] rpc.statd       S    0  1106      1 0x00000000
[31199.560000] atd             S    0  1111      1 0x00000000
[31199.610000] cron            S    0  1114      1 0x00000000
[31199.670000] getty           S    0  1120      1 0x00000000
[31199.720000] getty           S    0  1121      1 0x00000000
[31199.790000] getty           S    0  1122      1 0x00000000
[31199.850000] getty           S    0  1123      1 0x00000000
[31199.920000] getty           S    0  1124      1 0x00000000
[31199.980000] getty           S    0  1125      1 0x00000000
[31200.160000] sshd            S    0  1304   1101 0x00000000
[31200.220000] sshd            S    0  1306   1304 0x00000000
[31200.270000] bash            S    0  1307   1306 0x00000000
[31200.330000] bash            R  running task        0  1308 1307

User space processes.

----------------------------------------------------------------

* Following threads did fault:
----------------------------------------------------------------
[31197.680000] kworker/0:0     I    0     3      2 0x00000000
[31197.750000] Workqueue:    (null) (events)
[31197.800000] kworker/0:0H    I    0     4      2 0x00000000
[31197.930000] mm_percpu_wq    I    0     6      2 0x00000000
[31198.160000] kworker/u2:1    I    0     9      2 0x00000000
[31198.230000] Workqueue:    (null) (events_unbound)
[31198.330000] kworker/0:1     I    0    13      2 0x00000000
[31198.450000] writeback       I    0    95      2 0x00000000
[31198.510000] Workqueue:    (null) (flush-3:0)
[31198.570000] crypto          I    0    97      2 0x00000000
[31198.660000] kblockd         I    0    99      2 0x00000000
[31198.720000] Workqueue:    (null) (kblockd)
[31198.820000] kworker/0:1H    I    0   761      2 0x00000000
[31198.890000] Workqueue:    (null) (kblockd)
[31199.010000] ext4-rsv-conver I    0   795      2 0x00000000
[31200.050000] kworker/u2:0    I    0  1272      2 0x00000000
[31200.390000] kworker/u2:2    I    0  1310      2 0x00000000
[31200.460000] Workqueue:    (null) (events_unbound)
----------------------------------------------------------------

Kernel tasks.

=> *All* of them are kernel threads (kthreadd children) in 'I' state
   ('I' = interrupt context?)

Unlikely - may be interruptible sleep.

* *There are always two faults*

Looking at the kthread_probe_data() code:
----------------------------------------------------------------
void *kthread_probe_data(struct task_struct *task)
{
        struct kthread *kthread = to_kthread(task);
        void *data = NULL;

        probe_kernel_read(&data, &kthread->data, sizeof(data));
        return data;
}
----------------------------------------------------------------

And void print_worker_info() code:
----------------------------------------------------------------
void print_worker_info(const char *log_lvl, struct task_struct *task)
{
        work_func_t *fn = NULL;
        char name[WQ_NAME_LEN] = { };
        char desc[WORKER_DESC_LEN] = { };
        struct pool_workqueue *pwq = NULL;
        struct workqueue_struct *wq = NULL;
        struct worker *worker;
...
        worker = kthread_probe_data(task);
...
        probe_kernel_read(&fn, &worker->current_func, sizeof(fn));
        probe_kernel_read(&pwq, &worker->current_pwq, sizeof(pwq));
        probe_kernel_read(&wq, &pwq->wq, sizeof(wq));
        probe_kernel_read(name, wq->name, sizeof(name) - 1);
        probe_kernel_read(desc, worker->desc, sizeof(desc) - 1);
...
        if (fn || name[0] || desc[0]) {
                printk("%sWorkqueue: %s %pf", log_lvl, name, fn);
                if (strcmp(name, desc))
                        pr_cont(" (%s)", desc);
                pr_cont("\n");
        }
}
----------------------------------------------------------------

From the task output, we know that faulting items with workqueue
have empty "name" & "desc", but non-NULL "current_func".

Everything is initialized to NULLs, so if data fetch fails,
values are NULLs.

From backtraces, we know that at least one of the 2 faults is
from probe_kernel_read().

If variable would be NULL:
* task/kthread -> lots of faults
* worker -> 3 faults, two in probe, and one in above function
* current_func / fn -> no issues
* current_pwq / pwq -> 2 faults, one from probe
* wq -> 1 fault in above function
* name/desc -> can't be NULL

=> I think the problem is that 'I' kthreads have NULL "current_pwq".

Confirmed by the patch you attached so your analysis seems right.

Ones with workqueues just have "current_func" set, others don't.

Why that would affect / fault only on 030?

The 040/060 bus error trap may not force a bus error bypassing 
do_page_fault() in the same way the 030 handler does. I haven't yet 
looked at the 040/060 handler. Did I mention I really don't do memory 
management stuff?

The real question is - why are these fields NULL in the first place? And 
are they NULL only on 030?

Attached patch fixes the Oops for me.

I guess __probe_kernel_read() was meant to make checking for NULL 
pointers obsolete in these functions (where fields may well be NULL 
depending on context). I don't think your patch would be accepted, when 
a fix in the 030 fault handler does the job just as well.

Cheers,

	Michael Schmitz

    - Eero