Re: [PATCH 2/7] xfs: Check for per-inode extent count overflow

Chandan Babu R <chandanrlinux@xxxxxxxxx> · Sat, 20 Jun 2020 18:23:37 +0530

On Friday 19 June 2020 8:06:08 PM IST Christoph Hellwig wrote:
> I'm lost in 4 layers of full quotes.  Can someone summarize the
> discussion without hundreds of lines of irrelevant quotes?
> 

XFS does not check for possible overflow of per-inode extent counter fields
when adding extents to either data or attr fork.

For e.g.

1. Insert 5 million xattrs (each having a value size of 255 bytes) and then
   delete 50% of them in an alternating  manner. 
   ./benchmark-xattrs -l 255 -n 5000000 -s 50 -f $mntpnt/testfile-0

   benchmark-xattrs.c and related sources can be obtained from
   https://github.com/chandanr/xfs-xattr-benchmark/blob/master/src/

2. This causes 98511 extents to be created in the attr fork of the inode.
   xfsaild/loop0  2035 [003]  9643.390490: probe:xfs_iflush_int: (ffffffffac6225c0) if_nextents=98511 inode=131

3. The incore inode fork extent counter is a signed 32-bit quantity. However
   the on-disk extent counter is an unsigned 16-bit quantity and hence cannot
   hold 98511 extents.

4. The following incorrect value is stored in the attr extent counter
   # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
   core.naextents = -32561

As an aside, Please note that the sequence of events causing counter overflow
that have been described in this patch's description is not applicable since
the patches have been rebased on xfs-ifork-cleanup.2 which causes a signed
32-bit (xfs_extnum_t) quantity to be used to store the extent counter in
memory. However, the "on-disk counter overflow" bug still exists as was
described above.

To prevent the overflow bug from occuring silently, this patch now checks for
the overflow condition before incrementing the in-core value and returns an
error if a possible overflow is detected.

Darrick pointed out that returning an error code can cause the transaction to
be aborted because it is most likely to be dirty. Hence his suggestion (to
which I agree) was to check for possible overflow just after starting a
transaction and obtaining the inode's i_lock.

Also, since we are extending the data and xattr extent counters to 32 and 47
bits in the later patches the value of log reservations will change since they
are a function of maximum height of BMBT trees. The maximum of height of BMBT
trees are themselves calculated based on the maximum number of xattr and data
extents. Due to this, "min log size" can end up being larger than what was
calculated during mkfs.xfs time. This can cause mount to fail as shown by the
following call trace which was generated when executing xfs/306 test,

 XFS (loop0): Mounting V5 Filesystem
 XFS (loop0): Log size 8906 blocks too small, minimum size is 9075 blocks
 XFS (loop0): AAIEEE! Log failed size checks. Abort!
 XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 711
 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 12821 at fs/xfs/xfs_message.c:112 assfail+0x25/0x28
 Modules linked in:
 CPU: 0 PID: 12821 Comm: mount Tainted: G        W         5.6.0-rc6-next-20200320-chandan-00003-g071c2af3f4de #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
 RIP: 0010:assfail+0x25/0x28
 Code: ff ff 0f 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 40 b7 4b b3 e8 82 f9 ff ff 80 3d 83 d6 64 01 00 74 02 0f $
 RSP: 0018:ffffb05b414cbd78 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff9d9d501d5000 RCX: 0000000000000000
 RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffb346dc65
 RBP: ffff9da444b49a80 R08: 0000000000000000 R09: 0000000000000000
 R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffea
 R13: 000000000000000e R14: 0000000000004594 R15: ffff9d9d501d5628
 FS:  00007fd6c5d17c80(0000) GS:ffff9da44d800000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000002 CR3: 00000008a48c0000 CR4: 00000000000006f0
 Call Trace:
  xfs_log_mount+0xf8/0x300
  xfs_mountfs+0x46e/0x950
  xfs_fc_fill_super+0x318/0x510
  ? xfs_mount_free+0x30/0x30
  get_tree_bdev+0x15c/0x250
  vfs_get_tree+0x25/0xb0
  do_mount+0x740/0x9b0
  ? memdup_user+0x41/0x80
  __x64_sys_mount+0x8e/0xd0
  do_syscall_64+0x48/0x110
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fd6c5f2ccda
 Code: 48 8b 0d b9 e1 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f $
 RSP: 002b:00007ffe00dfb9f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
 RAX: ffffffffffffffda RBX: 0000560c1aaa92c0 RCX: 00007fd6c5f2ccda
 RDX: 0000560c1aaae110 RSI: 0000560c1aaad040 RDI: 0000560c1aaa94d0
 RBP: 00007fd6c607d204 R08: 0000000000000000 R09: 0000560c1aaadde0
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000560c1aaa94d0 R15: 0000560c1aaae110
 ---[ end trace 6436391b468bc652 ]---
 XFS (loop0): log mount failed

The corresponding filesystem was created using mkfs options
"-m rmapbt=1,reflink=1 -b size=1k -d size=20m -n size=64k".

To prevent such incidents, we are contemplating on using the following
approach,
1. Use existing constants for max extent counts (i.e. signed 2^16 for xattrs
   and signed 2^32 for data extents).
2. Compute max bmbt heights, log reservations and hence min log size during
   mount time.
3. Later, during the mount lifetime of the filesystem, when we are about to
   overflow the extent counter we use larger values (2^32 for xattr and 2^47
   for data) for max extent count and then recompute log reservations and min
   logsize. At this point, if on-disk log size is smaller than min log size
   we return with an error.
4. Otherwise, we set an RO-feature flag and use the revised log reservations
   for future transactions.

Please let me know your opinion on the above approaches.

--
chandan