On 3/8/2022 11:47, Teres Alexis, Alan Previn wrote:
On 3/1/2022 1:37 PM, John Harrison wrote:
On 2/25/2022 22:27, Alan Previn wrote:
...
This fixes a kernel page fault can happen when
multiple tests are running concurrently in a loop
and one is producing engine resets and consuming
the i915 error_state dump while the other is
forcing full GT resets. (takes a while to trigger).
Does need a fixes tag given that it is fixing a bug in an earlier
patch. Especially when it is a kernel BUG.
E.g.:
13:23> dim fixes 0e39037b31655
Fixes: 0e39037b3165 ("drm/i915: Cache the error string")
okay, will add that.
diff --git a/drivers/gpu/drm/i915/i915_sysfs.c
b/drivers/gpu/drm/i915/i915_sysfs.c
index a4d1759375b9..c40e51298066 100644
--- a/drivers/gpu/drm/i915/i915_sysfs.c
+++ b/drivers/gpu/drm/i915/i915_sysfs.c
@@ -432,7 +432,7 @@ static ssize_t error_state_read(struct file
*filp, struct kobject *kobj,
struct device *kdev = kobj_to_dev(kobj);
struct drm_i915_private *i915 = kdev_minor_to_i915(kdev);
struct i915_gpu_coredump *gpu;
- ssize_t ret;
+ ssize_t ret = 0;
gpu = i915_first_error_state(i915);
if (IS_ERR(gpu)) {
@@ -444,8 +444,10 @@ static ssize_t error_state_read(struct file
*filp, struct kobject *kobj,
const char *str = "No error state collected\n";
size_t len = strlen(str);
- ret = min_t(size_t, count, len - off);
- memcpy(buf, str + off, ret);
+ if (off < len) {
+ ret = min_t(size_t, count, len - off);
+ memcpy(buf, str + off, ret);
+ }
So the problem is that the error dump disappeared while it was being
read? That seems like it cause more problems than just this
out-of-range access. E.g. what if the dump was freed and a new one
created? The entity doing the partial reads would end up with half of
one dump and half of the next.
I think we should at least add a FIXME comment to the code that fast
recycling dumps could cause inconsistent sysfs reads.
I guess you could do something like add a unique id to the gpu
coredump structure. Then, when a partial sysfs read occurs starting
at zero (i.e. a new read), take a note of the id somewhere (e.g.
inside the i915 structure). When the next non-zero read comes in,
compare the save id with the current coredump's id and return an
error if there is a mis-match.
Not sure if this would be viewed as an important enough problem to be
worth fixing. I'd be happy with just a FIXME comment for now.
For now I shall add a FIXME additional checks might impact IGT's that
currently dump and check the error state. I would assume with that
additional check in kernel, we would need to return a specific error
value like -ENODATA or -ENOLINK and handle it accordingly in the igt.
If an an extra check against returning invalid data affects an existing
IGT then that IGT is already broken. The check would to prevent userland
reading the first half of one capture and then getting the second half
of a completely different one. If that is already happening then the
returned data is meaningless gibberish and even if the IGT says it is
passing, it really isn't.
I would also change the test to be 'if (off)' rather than 'if (off <
len)'. Technically, the user could have read the first 10 bytes of a
coredump and then gets "tate collected\n" as the remainder, for
example. If we ensure that 'off' is zero then we know that this is a
fresh read from scratch.
John.
I'm a little confused, did u mean: in the case we dont have a
gpu-state to report, and then the user offset is zero (i.e. "if
(!off)" ) then we copy the string vs if we dont have a gpu-state to
report and the user-offset is non-zero, then we return an error (like
the -ENOLINK?). If thats what you meant, then it does mean we are
assuming that (in the case we dont have a gpu-state) the user will
never come in with a first-time-read where the user's buffer size of
less than the string length and be expected continue to read the rest
of the string-length. This i guess is okay since even 6 chars are
enough to say "No err". :)
Sorry, yes. That was meant to say 'if (!off)'.
Hmm, I guess technically the user could be issuing single byte reads.
User's can be evil.
Okay. Maybe just add a FIXME about needing to check for
changed/missing/new error state since last read if the offset is
non-zero and otherwise leave it as is.
John.
}
return ret;