On Tue, Dec 10, 2024 at 12:54:39PM +0100, cem@xxxxxxxxxx wrote: > From: Carlos Maiolino <cmaiolino@xxxxxxxxxx> > > I tripped over an integer overflow when using a big journal size. > > Essentially I can reliably reproduce it using: > > mkfs.xfs -f -lsize=393216b -f -b size=4096 -m crc=1,reflink=1,rmapbt=1, \ > -i sparse=1 /dev/vdb2 > /dev/null So that's a 1.2GB log, well within the log size overflow point of 2^31 - 1 bytes. What version of xfsprogs are you using here? > mount -o usrquota,grpquota,prjquota /dev/vdb2 /mnt > xfs_io -x -c 'shutdown -f' /mnt Ok, so a shutdown with a log flush to leave the log dirty. What is in the log at this point? > umount /mnt > mount -o usrquota,grpquota,prjquota /dev/vdb2 /mnt > > The last mount command get stuck on the following path: > > [<0>] xlog_grant_head_wait+0x5d/0x2a0 [xfs] > [<0>] xlog_grant_head_check+0x112/0x180 [xfs] > [<0>] xfs_log_reserve+0xe3/0x260 [xfs] > [<0>] xfs_trans_reserve+0x179/0x250 [xfs] > [<0>] xfs_trans_alloc+0x101/0x260 [xfs] > [<0>] xfs_sync_sb+0x3f/0x80 [xfs] > [<0>] xfs_qm_mount_quotas+0xe3/0x2f0 [xfs] > [<0>] xfs_mountfs+0x7ad/0xc20 [xfs] > [<0>] xfs_fs_fill_super+0x762/0xa50 [xfs] > [<0>] get_tree_bdev_flags+0x131/0x1d0 > [<0>] vfs_get_tree+0x26/0xd0 > [<0>] vfs_cmd_create+0x59/0xe0 > [<0>] __do_sys_fsconfig+0x4e3/0x6b0 > [<0>] do_syscall_64+0x82/0x160 > [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > By investigating it a bit, I noticed that xlog_grant_head_check (called > from xfs_log_reserve), defines free_bytes as an integer, which in turn > is used to store the value from xlog_grant_space_left(). > xlog_grant_space_left() however, does return a uint64_t, and, giving a > big enough journal size, it can overflow the free_bytes in > xlog_grant_head_check(), I can see that an overflow definitely appears to be occurring, but I cannot explain how it is occurring from the information in commit message. That is, the return value of xlog_grant_space_left() is supposed to be clamped to 0 <= space <= log->l_logsize. If the return value is out of that range, (i.e. can overflow a signed int) it means there is some other problem with the xlog_grant_space_left() calculation and the overflow of the return value is a downstream symptom and not the root cause. i.e. by definition xlog_grant_space_left() must be returning free_bytes > log->l_logsize to overflow an int. The cause of that behaviour is what we need to find and fix.... We should have enough trace points in the AIL and log head/tail accounting to see where the head, tail or space calculation is going wrong during the mount - do you have a trace from the failed mount that I can look at? i.e. run 'trace-cmd record -e xfs\* sleep 60' in one terminal, then run the reproducer in another. Then when the trace finishes, run `trace-cmd report > t.txt` and point me at the generated report... > in xlog_grant_head_check() to evaluate to true and cause xfsaild to try > to flush the log indefinitely, which seems to be causing xfs to get > stuck in xlog_grant_head_wait() indefinitely. > > I'm adding a fixes tag as a suggestion from hch, giving that after the > aforementioned patch, all xlog_grant_space_left() callers should store > the return value on a 64bit type. > > Fixes: c1220522ef40 ("xfs: grant heads track byte counts, not LSNs") I'm not sure this is actually the source of the issue, or whether it simply exposed some other underlying problem we aren't yet aware of.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx