Re: [PATCH v2 1/1] drm/i915/reset: Fix error_state_read ptr + offset use

"Teres Alexis, Alan Previn" <alan.previn.teres.alexis@xxxxxxxxx> · Wed, 9 Mar 2022 17:45:11 -0800

On 3/9/2022 5:18 PM, John Harrison wrote:
On 3/8/2022 11:47, Teres Alexis, Alan Previn wrote:
On 3/1/2022 1:37 PM, John Harrison wrote:
On 2/25/2022 22:27, Alan Previn wrote:
...
This fixes a kernel page fault can happen when
multiple tests are running concurrently in a loop
and one is producing engine resets and consuming
the i915 error_state dump while the other is
forcing full GT resets. (takes a while to trigger).
Does need a fixes tag given that it is fixing a bug in an earlier 
patch. Especially when it is a kernel BUG.
E.g.:
13:23> dim fixes 0e39037b31655
Fixes: 0e39037b3165 ("drm/i915: Cache the error string")

okay, will add that.

diff --git a/drivers/gpu/drm/i915/i915_sysfs.c 
b/drivers/gpu/drm/i915/i915_sysfs.c
index a4d1759375b9..c40e51298066 100644
--- a/drivers/gpu/drm/i915/i915_sysfs.c
+++ b/drivers/gpu/drm/i915/i915_sysfs.c
@@ -432,7 +432,7 @@ static ssize_t error_state_read(struct file 
*filp, struct kobject *kobj,
      struct device *kdev = kobj_to_dev(kobj);
      struct drm_i915_private *i915 = kdev_minor_to_i915(kdev);
      struct i915_gpu_coredump *gpu;
-    ssize_t ret;
+    ssize_t ret = 0;
        gpu = i915_first_error_state(i915);
      if (IS_ERR(gpu)) {
@@ -444,8 +444,10 @@ static ssize_t error_state_read(struct file 
*filp, struct kobject *kobj,
          const char *str = "No error state collected\n";
          size_t len = strlen(str);
  -        ret = min_t(size_t, count, len - off);
-        memcpy(buf, str + off, ret);
+        if (off < len) {
+            ret = min_t(size_t, count, len - off);
+            memcpy(buf, str + off, ret);
+        }
So the problem is that the error dump disappeared while it was being 
read? That seems like it cause more problems than just this 
out-of-range access. E.g. what if the dump was freed and a new one 
created? The entity doing the partial reads would end up with half 
of one dump and half of the next.

I think we should at least add a FIXME comment to the code that fast 
recycling dumps could cause inconsistent sysfs reads.

I guess you could do something like add a unique id to the gpu 
coredump structure. Then, when a partial sysfs read occurs starting 
at zero (i.e. a new read), take a note of the id somewhere (e.g. 
inside the i915 structure). When the next non-zero read comes in, 
compare the save id with the current coredump's id and return an 
error if there is a mis-match.

Not sure if this would be viewed as an important enough problem to 
be worth fixing. I'd be happy with just a FIXME comment for now.
For now I shall add a FIXME additional checks might impact IGT's that 
currently dump and check the error state. I would assume with that 
additional check in kernel, we would need to return a specific error 
value like -ENODATA or -ENOLINK and handle it accordingly in the igt.
If an an extra check against returning invalid data affects an 
existing IGT then that IGT is already broken. The check would to 
prevent userland reading the first half of one capture and then 
getting the second half of a completely different one. If that is 
already happening then the returned data is meaningless gibberish and 
even if the IGT says it is passing, it really isn't.



I would also change the test to be 'if (off)' rather than 'if (off < 
len)'. Technically, the user could have read the first 10 bytes of a 
coredump and then gets "tate collected\n" as the remainder, for 
example. If we ensure that 'off' is zero then we know that this is a 
fresh read from scratch.

John.

I'm a little confused, did u mean: in the case we dont have a 
gpu-state to report, and then the user offset is zero (i.e. "if 
(!off)" ) then we copy the string vs if we dont have a gpu-state to 
report and the user-offset is non-zero, then we return an error (like 
the -ENOLINK?). If thats what you meant, then it does mean we are 
assuming that (in the case we dont have a gpu-state) the user will 
never come in with a first-time-read where the user's buffer size of 
less than the string length and be expected continue to read the rest 
of the string-length. This i guess is okay since even 6 chars are 
enough to say "No err". :)
Sorry, yes. That was meant to say 'if (!off)'.

Hmm, I guess technically the user could be issuing single byte reads. 
User's can be evil.

Okay. Maybe just add a FIXME about needing to check for 
changed/missing/new error state since last read if the offset is 
non-zero and otherwise leave it as is.

John.

Sounds good - will do. Apologies on the tardiness with the responses.

      }
        return ret;