Re: [PATCH v2 1/1] drm/i915/reset: Fix error_state_read ptr + offset use

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 3/9/2022 5:18 PM, John Harrison wrote:
On 3/8/2022 11:47, Teres Alexis, Alan Previn wrote:
On 3/1/2022 1:37 PM, John Harrison wrote:
On 2/25/2022 22:27, Alan Previn wrote:
...
This fixes a kernel page fault can happen when
multiple tests are running concurrently in a loop
and one is producing engine resets and consuming
the i915 error_state dump while the other is
forcing full GT resets. (takes a while to trigger).
Does need a fixes tag given that it is fixing a bug in an earlier patch. Especially when it is a kernel BUG.
E.g.:
13:23> dim fixes 0e39037b31655
Fixes: 0e39037b3165 ("drm/i915: Cache the error string")

okay, will add that.

diff --git a/drivers/gpu/drm/i915/i915_sysfs.c b/drivers/gpu/drm/i915/i915_sysfs.c
index a4d1759375b9..c40e51298066 100644
--- a/drivers/gpu/drm/i915/i915_sysfs.c
+++ b/drivers/gpu/drm/i915/i915_sysfs.c
@@ -432,7 +432,7 @@ static ssize_t error_state_read(struct file *filp, struct kobject *kobj,
      struct device *kdev = kobj_to_dev(kobj);
      struct drm_i915_private *i915 = kdev_minor_to_i915(kdev);
      struct i915_gpu_coredump *gpu;
-    ssize_t ret;
+    ssize_t ret = 0;
        gpu = i915_first_error_state(i915);
      if (IS_ERR(gpu)) {
@@ -444,8 +444,10 @@ static ssize_t error_state_read(struct file *filp, struct kobject *kobj,
          const char *str = "No error state collected\n";
          size_t len = strlen(str);
  -        ret = min_t(size_t, count, len - off);
-        memcpy(buf, str + off, ret);
+        if (off < len) {
+            ret = min_t(size_t, count, len - off);
+            memcpy(buf, str + off, ret);
+        }
So the problem is that the error dump disappeared while it was being read? That seems like it cause more problems than just this out-of-range access. E.g. what if the dump was freed and a new one created? The entity doing the partial reads would end up with half of one dump and half of the next.

I think we should at least add a FIXME comment to the code that fast recycling dumps could cause inconsistent sysfs reads.

I guess you could do something like add a unique id to the gpu coredump structure. Then, when a partial sysfs read occurs starting at zero (i.e. a new read), take a note of the id somewhere (e.g. inside the i915 structure). When the next non-zero read comes in, compare the save id with the current coredump's id and return an error if there is a mis-match.

Not sure if this would be viewed as an important enough problem to be worth fixing. I'd be happy with just a FIXME comment for now.
For now I shall add a FIXME additional checks might impact IGT's that currently dump and check the error state. I would assume with that additional check in kernel, we would need to return a specific error value like -ENODATA or -ENOLINK and handle it accordingly in the igt.
If an an extra check against returning invalid data affects an existing IGT then that IGT is already broken. The check would to prevent userland reading the first half of one capture and then getting the second half of a completely different one. If that is already happening then the returned data is meaningless gibberish and even if the IGT says it is passing, it really isn't.



I would also change the test to be 'if (off)' rather than 'if (off < len)'. Technically, the user could have read the first 10 bytes of a coredump and then gets "tate collected\n" as the remainder, for example. If we ensure that 'off' is zero then we know that this is a fresh read from scratch.

John.

I'm a little confused, did u mean: in the case we dont have a gpu-state to report, and then the user offset is zero (i.e. "if (!off)" ) then we copy the string vs if we dont have a gpu-state to report and the user-offset is non-zero, then we return an error (like the -ENOLINK?). If thats what you meant, then it does mean we are assuming that (in the case we dont have a gpu-state) the user will never come in with a first-time-read where the user's buffer size of less than the string length and be expected continue to read the rest of the string-length. This i guess is okay since even 6 chars are enough to say "No err". :)
Sorry, yes. That was meant to say 'if (!off)'.

Hmm, I guess technically the user could be issuing single byte reads. User's can be evil.

Okay. Maybe just add a FIXME about needing to check for changed/missing/new error state since last read if the offset is non-zero and otherwise leave it as is.

John.

Sounds good - will do. Apologies on the tardiness with the responses.

      }
        return ret;





[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux