Re: [PATCH v2 1/1] drm/i915/reset: Fix error_state_read ptr + offset use

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 3/1/2022 1:37 PM, John Harrison wrote:
On 2/25/2022 22:27, Alan Previn wrote:
...
This fixes a kernel page fault can happen when
multiple tests are running concurrently in a loop
and one is producing engine resets and consuming
the i915 error_state dump while the other is
forcing full GT resets. (takes a while to trigger).
Does need a fixes tag given that it is fixing a bug in an earlier patch. Especially when it is a kernel BUG.
E.g.:
13:23> dim fixes 0e39037b31655
Fixes: 0e39037b3165 ("drm/i915: Cache the error string")

okay, will add that.

diff --git a/drivers/gpu/drm/i915/i915_sysfs.c b/drivers/gpu/drm/i915/i915_sysfs.c
index a4d1759375b9..c40e51298066 100644
--- a/drivers/gpu/drm/i915/i915_sysfs.c
+++ b/drivers/gpu/drm/i915/i915_sysfs.c
@@ -432,7 +432,7 @@ static ssize_t error_state_read(struct file *filp, struct kobject *kobj,
      struct device *kdev = kobj_to_dev(kobj);
      struct drm_i915_private *i915 = kdev_minor_to_i915(kdev);
      struct i915_gpu_coredump *gpu;
-    ssize_t ret;
+    ssize_t ret = 0;
        gpu = i915_first_error_state(i915);
      if (IS_ERR(gpu)) {
@@ -444,8 +444,10 @@ static ssize_t error_state_read(struct file *filp, struct kobject *kobj,
          const char *str = "No error state collected\n";
          size_t len = strlen(str);
  -        ret = min_t(size_t, count, len - off);
-        memcpy(buf, str + off, ret);
+        if (off < len) {
+            ret = min_t(size_t, count, len - off);
+            memcpy(buf, str + off, ret);
+        }
So the problem is that the error dump disappeared while it was being read? That seems like it cause more problems than just this out-of-range access. E.g. what if the dump was freed and a new one created? The entity doing the partial reads would end up with half of one dump and half of the next.

I think we should at least add a FIXME comment to the code that fast recycling dumps could cause inconsistent sysfs reads.

I guess you could do something like add a unique id to the gpu coredump structure. Then, when a partial sysfs read occurs starting at zero (i.e. a new read), take a note of the id somewhere (e.g. inside the i915 structure). When the next non-zero read comes in, compare the save id with the current coredump's id and return an error if there is a mis-match.

Not sure if this would be viewed as an important enough problem to be worth fixing. I'd be happy with just a FIXME comment for now.
For now I shall add a FIXME additional checks might impact IGT's that currently dump and check the error state. I would assume with that additional check in kernel, we would need to return a specific error value like -ENODATA or -ENOLINK and handle it accordingly in the igt.

I would also change the test to be 'if (off)' rather than 'if (off < len)'. Technically, the user could have read the first 10 bytes of a coredump and then gets "tate collected\n" as the remainder, for example. If we ensure that 'off' is zero then we know that this is a fresh read from scratch.

John.

I'm a little confused, did u mean: in the case we dont have a gpu-state to report, and then the user offset is zero (i.e. "if (!off)" ) then we copy the string vs if we dont have a gpu-state to report and the user-offset is non-zero, then we return an error (like the -ENOLINK?). If thats what you meant, then it does mean we are assuming that (in the case we dont have a gpu-state) the user will never come in with a first-time-read where the user's buffer size of less than the string length and be expected continue to read the rest of the string-length. This i guess is okay since even 6 chars are enough to say "No err".  :)
      }
        return ret;




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux