Hi Johannes
Thanks for the response.
On 2/8/2022 12:35 PM, Johannes Berg wrote:
On Tue, 2022-02-08 at 11:44 -0800, Abhinav Kumar wrote:
There are cases where depending on the size of the devcoredump and the speed
at which the usermode reads the dump, it can take longer than the current 5 mins
timeout.
This can lead to incomplete dumps as the device is deleted once the timeout expires.
One example is below where it took 6 mins for the devcoredump to be completely read.
04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
Increase the timeout to 10 mins to accommodate system delays and large coredump
sizes.
No real objection, I guess, but can the data actually disappear *while*
the sysfs file is open?!
Or did it take 5 minutes to open the file?
If the former, maybe we should fix that too (or instead)?
johannes
It opened the file rightaway but could not finish reading.
The device gets deleted so the corresponding /data will disappear too (
as the data node is under devcd*/data)
60 static void devcd_del(struct work_struct *wk)
61 {
62 struct devcd_entry *devcd;
63
64 devcd = container_of(wk, struct devcd_entry, del_wk.work);
65
66 device_del(&devcd->devcd_dev);
67 put_device(&devcd->devcd_dev);
68 }
Are you suggesting we implement a logic like :
a) if the usermode has started reading the data but has not finished yet
( we can detect the former with something like devcd->data_read_ongoing
= 1 and we know it has finished when it acks and we can clear this flag
then), in the timeout del_wk then we can delay the the delete timer by
another TIMEOUT amount of time to give usermode time to finish the data?
b) If usermode acks, we will clear both the flag and delete the device
as usual
But there is a corner case here:
c) If usermode starts the read, but then for some reason crashes, the
timer will timeout and try to delete the device but will detect that
usermode is still reading and will keep the device. How do we detect
this case?
Thats why i thought maybe the easier way right now is to try increasing
the timeout.