On Thu, 2020-10-08 at 23:53 +0300, Sakari Ailus wrote: > Hi Tsuchiya, > > On Thu, Oct 08, 2020 at 10:17:03PM +0900, Tsuchiya Yuto wrote: > > Hi, I'm one of the people who are trying to get ipu3 cameras working on > > regular PCs that came with Windows OS. > > > > I found that the ipu3-cio2 driver causes the kernel to hang on getting > > device topology (like "media-ctl -p -d /dev/media0" or capturing images > > with libcamera) when the kernel option "Initialize kernel stack variables > > at function entry" is above "strong" ("CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF=y"). > > > > I noticed this issue because Arch Linux sets this option to "very strong" > > ("CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y"). > > > > This issue happens even without sensor drivers or cio2-bridge driver > > currently being developed [1]. So, I think this issue is reproducible > > easily on regular PCs equipped with the IPU3 system as well. > > > > The way the kernel crashes varies slightly from series to series: > > - The latest stable (v5.8.y) and rc (v5.9-rcx) > > When this issue happened, the kernel just hangs. No journal log after > > the hang. > > - The latest LTS (v5.4.y) > > When this issue happened, the kernel shows the following oops: > > > > BUG: stack guard page was hit at 00000000486e5acd (stack is 000000006e2c667d..0000000010408970) > > kernel stack overflow (double-fault): 0000 [#1] SMP PTI > > CPU: 2 PID: 2535 Comm: media-ctl Tainted: G C 5.4.69-1-lts #1 > > Hardware name: Microsoft Corporation Surface Book/Surface Book, BIOS 92.3192.768 03.24.2020 > > RIP: 0010:cio2_subdev_get_fmt+0x2c/0x180 [ipu3_cio2] > > > > I added the full oops at the bottom of this mail. > > > > According to the description of the kernel option, it seems that the > > uninitialized variables are used somewhere in the cio2_subdev_get_fmt() > > [ipu3_cio2.c] ? > > > > Steps to reproduce: > > 1. Build the kernel with the option set to > > "strong" ("CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF=y") or > > "very strong" ("CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y"). > > 2. Boot with the kernel and try to get the device topology by the command > > like the following: > > > > $ media-ctl -p -d /dev/media0 > > > > 3. The kernel just hangs on the 5.8 and 5.9-rc, or prints the oops on 5.4 > > > > What I found so far: > > I tried print debug like the following: > > > > 1241 static int cio2_subdev_get_fmt(struct v4l2_subdev *sd, > > 1242 struct v4l2_subdev_pad_config *cfg, > > 1243 struct v4l2_subdev_format *fmt) > > 1244 { > > 1245 struct cio2_queue *q = container_of(sd, struct cio2_queue, subdev); > > 1246 struct v4l2_subdev_format format; > > 1247 int ret; > > 1248 > > 1249 pr_info("DEBUG: %s() called\n", __func__); > > 1250 pr_info("DEBUG: msleep()\n"); > > 1251 msleep(1000); > > 1252 > > 1253 if (fmt->which == V4L2_SUBDEV_FORMAT_TRY) { > > 1254 pr_info("DEBUG: Passed %s() %d\n", __func__, __LINE__); > > 1255 fmt->format = *v4l2_subdev_get_try_format(sd, cfg, fmt->pad); > > 1256 return 0; > > 1257 } > > 1258 > > 1259 pr_info("DEBUG: Passed %s() %d\n", __func__, __LINE__); > > 1260 > > 1261 if (fmt->pad == CIO2_PAD_SINK) { > > 1262 pr_info("DEBUG: Passed %s() %d\n", __func__, __LINE__); > > 1263 format.which = V4L2_SUBDEV_FORMAT_ACTIVE; > > 1264 ret = v4l2_subdev_call(sd, pad, get_fmt, NULL, > > 1265 &format); > > > > $ media-ctl -p -d /dev/media0 > > Media controller API version 5.9.0 > > > > Media device information > > ------------------------ > > driver ipu3-cio2 > > model Intel IPU3 CIO2 > > serial > > bus info PCI:0000:00:14.3 > > hw revision 0x0 > > driver version 5.9.0 > > > > > > Device topology > > - entity 1: ipu3-csi2 0 (2 pads, 1 link) > > type V4L2 subdev subtype Unknown flags 0 > > device node name /dev/v4l-subdev0 > > pad0: Sink > > # [output stopped here] > > > > $ dmesg -xw > > [ 871.807563] kernel: DEBUG: cio2_subdev_get_fmt() called > > [ 871.807566] kernel: DEBUG: msleep() > > [ 872.821254] kernel: DEBUG: Passed cio2_subdev_get_fmt() 1259 > > [ 872.821258] kernel: DEBUG: Passed cio2_subdev_get_fmt() 1262 > > # [...] (same output repeatedly) > > [ 986.313536] kernel: DEBUG: cio2_subdev_get_fmt() called > > [ 986.313538] kernel: DEBUG: msleep() > > [ 987.326899] kernel: DEBUG: Passed cio2_subdev_get_fmt() 1259 > > [ 987.326904] kernel: DEBUG: Passed cio2_subdev_get_fmt() 1262 > > [ 987.326908] kernel: DEBUG: cio2_subdev_get_fmt() called > > [ 987.326910] kernel: DEBUG: msleep() > > (then, system hanged) > > > > So, it looks like the following loop is happening there: > > 1. cio2_subdev_get_fmt() calls v4l2_subdev_call() > > 2. v4l2_subdev_call() internally calls cio2_subdev_get_fmt() again > > > > Does anyone have any ideas what's happening? > > First of all, thank you for a very thorough and informative bug report. It > looks like a driver bug indeed. Thank you for your patch! I'm glad that the bug report is informative enough to find what is happening. > I don't know how this has escaped review and testing earlier though. It's > so clear. > > Anyway, I hope the patchset I just sent fixes it for you. Please let me > know if there are issues. I tried the v2 version of your patchset with the option set to "very strong" ("CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y"). I can confirm that the patchset fixed the system hang on v5.9-rc8. Thank you again.