On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote: > > Hi Ilya, > > On Mon, Sep 19, 2022 at 12:14:06PM +0200, Ilya Dryomov wrote: > > On Mon, Sep 19, 2022 at 9:43 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote: > >>> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote: > >>>> What can make a "rbd unmap" fail, assuming the device is not > >>>> mounted and not (obviously) open by any other processes? > >> > >> E.g. maybe there's some way of using ebpf or similar to look at the > >> 'rbd_dev->open_count' in the live kernel? > >> > >> And/or maybe there's some way, again using ebpf or similar, to record > >> sufficient info (e.g. a stack trace?) from rbd_open() and > >> rbd_release() to try to identify something that's opening the device > >> and not releasing it? > > > > Attaching kprobes to rbd_open() and rbd_release() is probably the > > fastest option. I don't think you even need a stack trace, PID and > > comm (process name) should do. I would start with something like: > > > > # bpftrace -e 'kprobe:rbd_open { printf("open pid %d comm %s\n", pid, > > comm) } kprobe:rbd_release { printf("release pid %d comm %s\n", pid, > > comm) }' > > > > Fetching the actual rbd_dev->open_count value is more involved but > > also doable. > > Excellent! Thanks! > > tl;dr there's something other than the open_count causing the unmap > failures - or something's elevating and decrementing open_count without > going through rbd_open and rbd_release. Or perhaps there's some situation > whereby bpftrace "misses" recording calls to rbd_open and rbd_release. > > FYI, the production process is: > > - create snapshot of rbd > - map > - mount with ro,norecovery,nouuid (the original live fs is still mounted) > - export via NFS > - mount on Windows NFS client > - process on Windows > - remove Windows NFS mount > - unexport from NFS > - unmount > - unmap > > (I haven't mentioned the NFS export previously because I thought the > issue was replicable without it - but that might simply have been due to > the 'pvs' issue which has been resolved.) > > I now have a script that mimics the above production sequence in a loop > and left it running all night. Out of 288 iterations it had 13 instances > where the unmap was failing for some time (i.e. in all cases it > eventually succeeded, unlike the 51 rbd devices I can't seem to unmap at > all without using --force). In the failing cases the unmap was retried > at 1 second intervals. The shortest time taken to eventually umap was > 521 seconds, the longest was 793 seconds. > > Note, in the below I'm using "successful" for the tests where the first > unmap succeeded, and "failed" for the tests where the first unmap > failed, although in all cases the unmap eventually succeeded. > > I ended up with a bpftrace script (see below) that logs the timestamp, > open or release (O/R), pid, device name, open_count (at entry to the > function), and process name. > > A successful iteration of that process mostly looks like this: > > Timestamp O/R Pid Device Count Process > 18:21:18.235870 O 3269426 rbd29 0 mapper > 18:21:20.088873 R 3269426 rbd29 1 mapper > 18:21:20.089346 O 3269447 rbd29 0 systemd-udevd > 18:21:20.105281 O 3269457 rbd29 1 blkid > 18:21:31.858621 R 3269457 rbd29 2 blkid > 18:21:31.861762 R 3269447 rbd29 1 systemd-udevd > 18:21:31.882235 O 3269475 rbd29 0 mount > 18:21:38.241808 R 3269475 rbd29 1 mount > 18:21:38.242174 O 3269475 rbd29 0 mount > 18:22:49.646608 O 2364320 rbd29 1 rpc.mountd > 18:22:58.715634 R 2364320 rbd29 2 rpc.mountd > 18:23:55.564512 R 3270060 rbd29 1 umount > > Or occasionally it looks like this, with "rpc.mountd" disappearing: > > 18:35:49.539224 O 3277664 rbd29 0 mapper > 18:35:50.515777 R 3277664 rbd29 1 mapper > 18:35:50.516224 O 3277685 rbd29 0 systemd-udevd > 18:35:50.531978 O 3277694 rbd29 1 blkid > 18:35:57.361799 R 3277694 rbd29 2 blkid > 18:35:57.365263 R 3277685 rbd29 1 systemd-udevd > 18:35:57.384316 O 3277713 rbd29 0 mount > 18:36:01.234337 R 3277713 rbd29 1 mount > 18:36:01.234849 O 3277713 rbd29 0 mount > 18:37:21.304270 R 3289527 rbd29 1 umount > > Of the 288 iterations, only 20 didn't include the rpc.mountd lines. > > An unsuccessful iteration looks like this: > > 18:37:31.885408 O 3294108 rbd29 0 mapper > 18:37:33.181607 R 3294108 rbd29 1 mapper > 18:37:33.182086 O 3294175 rbd29 0 systemd-udevd > 18:37:33.197982 O 3294691 rbd29 1 blkid > 18:37:42.712870 R 3294691 rbd29 2 blkid > 18:37:42.716296 R 3294175 rbd29 1 systemd-udevd > 18:37:42.738469 O 3298073 rbd29 0 mount > 18:37:49.339012 R 3298073 rbd29 1 mount > 18:37:49.339352 O 3298073 rbd29 0 mount > 18:38:51.390166 O 2364320 rbd29 1 rpc.mountd > 18:39:00.989050 R 2364320 rbd29 2 rpc.mountd > 18:53:56.054685 R 3313923 rbd29 1 init > > According to my script log, the first unmap attempt was at 18:39:42, > i.e. 42 seconds after rpc.mountd released the device. At that point the > the open_count was (or should have been?) 1 again allowing the unmap to > succeed - but it didn't. The unmap was retried every second until it Hi Chris, For unmap to go through, open_count must be 0. rpc.mountd at 18:39:00.989050 just decremented it from 2 to 1, it didn't release the device. > eventually succeeded at 18:53:56, the same time as the mysterious "init" > process ran - but also note there is NO "umount" process in there so I > don't know if the name of the process recorded by bfptrace is simply > incorrect (but how would that happen??) or what else could be going on. I would suggest adding the PID and the kernel stack trace at this point. > > All 13 of the failed iterations recorded that weird "init" instead of > "umount". Yeah, that seems to be the culprit. > > 12 of the 13 failed iterations included rpc.mountd in the trace, but one > didn't (i.e. it went direct from mount to init/umount, like the 2nd > successful example above), i.e. around the same proportion as the > successful iterations. > > So it seems there's something other than the open_count causing the unmap > failures - or something's elevating and decrementing open_count without > going through rbd_open and rbd_release. Or perhaps there's some situation > whereby bpftrace "misses" recording calls to rbd_open and rbd_release. > > > The bpftrace script looks like this: > -------------------------------------------------------------------- > // > // bunches of defines and structure definitions extracted from > // drivers/block/rbd.c elided here... > // It would be good to attach the entire script, just in case someone runs into a similar issue in the future and tries to debug the same way. Thanks, Ilya