Hi Tsuchiya, On Thu, Oct 08, 2020 at 10:17:03PM +0900, Tsuchiya Yuto wrote: > Hi, I'm one of the people who are trying to get ipu3 cameras working on > regular PCs that came with Windows OS. > > I found that the ipu3-cio2 driver causes the kernel to hang on getting > device topology (like "media-ctl -p -d /dev/media0" or capturing images > with libcamera) when the kernel option "Initialize kernel stack variables > at function entry" is above "strong" ("CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF=y"). > > I noticed this issue because Arch Linux sets this option to "very strong" > ("CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y"). > > This issue happens even without sensor drivers or cio2-bridge driver > currently being developed [1]. So, I think this issue is reproducible > easily on regular PCs equipped with the IPU3 system as well. > > The way the kernel crashes varies slightly from series to series: > - The latest stable (v5.8.y) and rc (v5.9-rcx) > When this issue happened, the kernel just hangs. No journal log after > the hang. > - The latest LTS (v5.4.y) > When this issue happened, the kernel shows the following oops: > > BUG: stack guard page was hit at 00000000486e5acd (stack is 000000006e2c667d..0000000010408970) > kernel stack overflow (double-fault): 0000 [#1] SMP PTI > CPU: 2 PID: 2535 Comm: media-ctl Tainted: G C 5.4.69-1-lts #1 > Hardware name: Microsoft Corporation Surface Book/Surface Book, BIOS 92.3192.768 03.24.2020 > RIP: 0010:cio2_subdev_get_fmt+0x2c/0x180 [ipu3_cio2] > > I added the full oops at the bottom of this mail. > > According to the description of the kernel option, it seems that the > uninitialized variables are used somewhere in the cio2_subdev_get_fmt() > [ipu3_cio2.c] ? > > Steps to reproduce: > 1. Build the kernel with the option set to > "strong" ("CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF=y") or > "very strong" ("CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y"). > 2. Boot with the kernel and try to get the device topology by the command > like the following: > > $ media-ctl -p -d /dev/media0 > > 3. The kernel just hangs on the 5.8 and 5.9-rc, or prints the oops on 5.4 > > What I found so far: > I tried print debug like the following: > > 1241 static int cio2_subdev_get_fmt(struct v4l2_subdev *sd, > 1242 struct v4l2_subdev_pad_config *cfg, > 1243 struct v4l2_subdev_format *fmt) > 1244 { > 1245 struct cio2_queue *q = container_of(sd, struct cio2_queue, subdev); > 1246 struct v4l2_subdev_format format; > 1247 int ret; > 1248 > 1249 pr_info("DEBUG: %s() called\n", __func__); > 1250 pr_info("DEBUG: msleep()\n"); > 1251 msleep(1000); > 1252 > 1253 if (fmt->which == V4L2_SUBDEV_FORMAT_TRY) { > 1254 pr_info("DEBUG: Passed %s() %d\n", __func__, __LINE__); > 1255 fmt->format = *v4l2_subdev_get_try_format(sd, cfg, fmt->pad); > 1256 return 0; > 1257 } > 1258 > 1259 pr_info("DEBUG: Passed %s() %d\n", __func__, __LINE__); > 1260 > 1261 if (fmt->pad == CIO2_PAD_SINK) { > 1262 pr_info("DEBUG: Passed %s() %d\n", __func__, __LINE__); > 1263 format.which = V4L2_SUBDEV_FORMAT_ACTIVE; > 1264 ret = v4l2_subdev_call(sd, pad, get_fmt, NULL, > 1265 &format); > > $ media-ctl -p -d /dev/media0 > Media controller API version 5.9.0 > > Media device information > ------------------------ > driver ipu3-cio2 > model Intel IPU3 CIO2 > serial > bus info PCI:0000:00:14.3 > hw revision 0x0 > driver version 5.9.0 > > Device topology > - entity 1: ipu3-csi2 0 (2 pads, 1 link) > type V4L2 subdev subtype Unknown flags 0 > device node name /dev/v4l-subdev0 > pad0: Sink > # [output stopped here] > > $ dmesg -xw > [ 871.807563] kernel: DEBUG: cio2_subdev_get_fmt() called > [ 871.807566] kernel: DEBUG: msleep() > [ 872.821254] kernel: DEBUG: Passed cio2_subdev_get_fmt() 1259 > [ 872.821258] kernel: DEBUG: Passed cio2_subdev_get_fmt() 1262 > # [...] (same output repeatedly) > [ 986.313536] kernel: DEBUG: cio2_subdev_get_fmt() called > [ 986.313538] kernel: DEBUG: msleep() > [ 987.326899] kernel: DEBUG: Passed cio2_subdev_get_fmt() 1259 > [ 987.326904] kernel: DEBUG: Passed cio2_subdev_get_fmt() 1262 > [ 987.326908] kernel: DEBUG: cio2_subdev_get_fmt() called > [ 987.326910] kernel: DEBUG: msleep() > (then, system hanged) > > So, it looks like the following loop is happening there: > 1. cio2_subdev_get_fmt() calls v4l2_subdev_call() > 2. v4l2_subdev_call() internally calls cio2_subdev_get_fmt() again > > Does anyone have any ideas what's happening? First of all, thank you for a very thorough and informative bug report. It looks like a driver bug indeed. I don't know how this has escaped review and testing earlier though. It's so clear. Anyway, I hope the patchset I just sent fixes it for you. Please let me know if there are issues. -- Kind regards, Sakari Ailus