On Sun, Apr 04, 2021 at 02:34:08AM +0000, Al Viro wrote: > On Thu, Apr 01, 2021 at 07:11:12PM +0000, Al Viro wrote: > > > > I _think_ I see what the issue is. It seems that an assumption made in > > > this commit might be wrong and we're missing a mnt_add_count() bump that > > > we would otherwise have gotten if we've moved the failure handling into > > > the unlazy helpers themselves. > > > > > > Al, does that sound plausible? > > > > mnt_add_count() on _what_? Failure in legitimize_links() ends up with > > nd->path.mnt zeroed, in both callers. So which vfsmount would be > > affected? It looks to me like it's the vfsmount of the nd->root after we called chroot or pivot_root. > > Could you turn that WARN_ON(count < 0) into > if (WARN_ON(count < 0)) > printk(KERN_ERR "id = %d, dev = %s, count = %d\n", > mnt->mnt_id, > mnt->mnt_sb->s_id, > count); > add system("cat /proc/self/mountinfo"); right after sandbox_common() > call and try to reproduce that? Sorry for not replying to your earlier mail but I've been debugging this too. My current theory is that it's related to LOOKUP_ROOT_GRABBED when LOOKUP_CACHED is specified _possibly_ with an interaction how create_io_thread() is created with CLONE_FS. The reproducer requires you either have called pivot_root() or chroot() in order for the failure to happen. So I think the fact that we skip legitimize_root() when LOOKUP_CACHED is set might figure into this. I can keep digging. Funny enough I already placed a printk statement into the place you wanted one too so I just amended mine. Here's what you get: If pivot pivot_root() is used before the chroot() you get: [ 637.464555] AAAA: count(-1) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(579) | dev(tmpfs) if you only call chroot, i.e. make the pivot_root() branch a simple if (true) you get: [ 955.206117] AAAA: count(-2) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(580) | dev(tmpfs) The cat /proc/self/mountinfo is for the id(579) below: 514 513 8:2 / / rw,relatime - ext4 /dev/sda2 rw 515 514 0:5 / /dev rw,nosuid,noexec,relatime - devtmpfs udev rw,size=302716k,nr_inodes=75679,mode=755 516 515 0:26 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000 517 515 0:28 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw 518 515 0:48 / /dev/hugepages rw,relatime - hugetlbfs hugetlbfs rw,pagesize=2M 519 515 0:21 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw 520 514 0:27 / /run rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,size=62152k,mode=755 521 520 0:29 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,size=5120k 522 520 0:49 / /run/lxd_agent rw,relatime - tmpfs tmpfs rw,size=51200k,mode=700 523 520 0:59 / /run/user/1000 rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=62148k,nr_inodes=15537,mode=700,uid=1000,gid=1000 524 514 0:24 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw 525 524 0:6 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime - securityfs securityfs rw 526 524 0:30 / /sys/fs/cgroup ro,nosuid,nodev,noexec - tmpfs tmpfs ro,size=4096k,nr_inodes=1024,mode=755 527 526 0:31 /../../.. /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw 528 526 0:32 /../../.. /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,name=systemd 529 526 0:36 /../../.. /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory 530 526 0:37 /.. /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct 531 526 0:38 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,hugetlb 549 526 0:39 /.. /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio 550 526 0:40 /../../.. /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer 551 526 0:41 /../../.. /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids 552 526 0:42 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio 553 526 0:43 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event 554 526 0:44 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset,clone_children 555 526 0:45 /.. /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices 556 526 0:46 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,rdma 557 524 0:33 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime - pstore pstore rw 558 524 0:34 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime - efivarfs efivarfs rw 559 524 0:35 / /sys/fs/bpf rw,nosuid,nodev,noexec,relatime - bpf none rw,mode=700 560 524 0:7 / /sys/kernel/debug rw,nosuid,nodev,noexec,relatime - debugfs debugfs rw 561 524 0:12 / /sys/kernel/tracing rw,nosuid,nodev,noexec,relatime - tracefs tracefs rw 562 524 0:51 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime - fusectl fusectl rw 563 524 0:20 / /sys/kernel/config rw,nosuid,nodev,noexec,relatime - configfs configfs rw 564 514 0:25 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw 565 564 0:47 / /proc/sys/fs/binfmt_misc rw,relatime - autofs systemd-1 rw,fd=29,pgrp=0,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=31453 566 565 0:52 / /proc/sys/fs/binfmt_misc rw,nosuid,nodev,noexec,relatime - binfmt_misc binfmt_misc rw 567 514 0:50 / /home/ubuntu/src/compiled rw,relatime - virtiofs lxd_lxc rw 568 514 8:1 / /boot/efi rw,relatime - vfat /dev/sda1 rw,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 569 514 0:57 / /var/lib/lxcfs rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other > > I really wonder what mount is it happening to. BTW, how painful would > it be to teach syzcaller to turn those cascades of > NONFAILING(*(uint8_t*)0x20000080 = 0x12); > NONFAILING(*(uint8_t*)0x20000081 = 0); > NONFAILING(*(uint16_t*)0x20000082 = 0); > NONFAILING(*(uint32_t*)0x20000084 = 0xffffff9c); > NONFAILING(*(uint64_t*)0x20000088 = 0); > NONFAILING(*(uint64_t*)0x20000090 = 0x20000180); > NONFAILING(memcpy((void*)0x20000180, "./file0\000", 8)); > NONFAILING(*(uint32_t*)0x20000098 = 0); > NONFAILING(*(uint32_t*)0x2000009c = 0x80); > NONFAILING(*(uint64_t*)0x200000a0 = 0x23456); > .... > NONFAILING(syz_io_uring_submit(r[1], r[2], 0x20000080, 0)); > into something more readable? Bloody annoyance every time... Sure, I can > manually translate it into > struct io_uring_sqe *sqe = (void *)0x20000080; > char *s = (void *)0x20000180; > memset(sqe, '\0', sizeof(*sqe)); > sqe->opcode = 0x12; // IORING_OP_OPENAT? > sqe->fd = -100; // AT_FDCWD? > sqe->addr = s; > strcpy(s, "./file0"); > sqe->open_flags = 0x80; // O_EXCL??? > sqe->user_data = 0x23456; // random tag? > syz_io_uring_submit(r[1], r[2], (unsigned long)p, 0); > but it's really annoying as hell, especially since syz_io_uring_submit() > comes from syzcaller and the damn thing _knows_ that the third argument > is sodding io_uring_sqe, and never passed to anything other than > memcpy() in there, at that, so the exact address can't matter. Yeah, it's terrible to get an idea of what's going on in such produces. I did the same exercise as you. I should probably have replied with all of that data sooner so you wouldn't have to do the same shite I had to. Oh well. > > Incidentally, solitary O_EXCL (without O_CREAT) is... curious. Does that > sucker still trigger without it? I.e. with store to 0x2000009c replaced > with storing 0? Yeah, I did that as well yesterday morning and it still triggers. Christian