https://bugzilla.kernel.org/show_bug.cgi?id=216110 --- Comment #4 from Darrick J. Wong (djwong@xxxxxxxxxx) --- On Fri, Jun 10, 2022 at 08:27:38AM +0000, bugzilla-daemon@xxxxxxxxxx wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=216110 > > Bug ID: 216110 > Summary: rmdir sub directory cause i_nlink of parent directory > down from 0 to 0xffffffff > Product: File System > Version: 2.5 > Kernel Version: linux-3.10.0-957.el7 Please contact your RHEL7 ^^^^^^^^^^^^^^ account representative for assistance in triaging this bug. --D > Hardware: Other > OS: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: XFS > Assignee: filesystem_xfs@xxxxxxxxxxxxxxxxxxxxxx > Reporter: hexiaole1994@xxxxxxx > Regression: No > > 1. synptom > when user executed mkdir command under parent directory, mkdir command > prompted > "Too many links". > > > 2. basic analysis > (1)use "getconf LINK_MAX ." under parent directory, the max i_nlink of the > xfs(the filesystem that parent directory belongs) is 2147483647, but the > i_nlink of the parent directory now is 4294967109, because the mkdir command > will check if the i_nlink of the parent directory is lower than the LINK_MAX, > in our environment this check failed, so mkdir command prompt "Too many > links". > (2)we "cd" into the parent directory, and execute "ls|wc" to accounting the > total files of the parent directory, the result is 308875 > (3)the i_nlink by definition is "the number of links to the inode from > directories", a newly created directory has i_nlink of 2, and the i_nlink of > this newly created directory will plus 1 once there has a sub directory > created > under it(the sub directory's ".." points to parent directory cause the > i_nlink > of the parent directory plus 1), so the i_nlink of the parent directory can > also reflect the number of the sub directories(the number of sub directory = > i_nlink of the parent - 2). the i_nlink of the parent directory now is > 4294967109, if this i_nlink is valid, the number of the sub directoryes might > be 4294967109, but like the (2) shows, the total files(include directories) > under the parent directory is 308875. so we can assert the i_nlink metadata > of > the parent direcotry was corrupted. > (4)in the dmesg file of the sos_report, we saw an call trace that related to > this corrupted i_nlink of parent directory: > ... > [26038585.616782] ------------[ cut here ]------------ > [26038585.616794] WARNING: CPU: 22 PID: 21088 at fs/inode.c:284 > drop_nlink+0x3e/0x50 > [26038585.616796] Modules linked in: binfmt_misc tcp_diag inet_diag 8021q > garp > mrp stp llc bonding vfat fat ipmi_ssif amd64_edac_mod edac_mce_amd kvm joydev > irqbypass ses enclosure pcspkr scsi_transport_sas sg ipmi_si ipmi_devintf > ipmi_msghandler i2c_piix4 acpi_cpufreq ip_tables xfs libcrc32c sd_mod > crc_t10dif crct10dif_generic crct10dif_common ast crc32c_intel drm_kms_helper > syscopyarea sysfillrect igb ixgbe sysimgblt fb_sys_fops ttm i2c_algo_bit mdio > ptp drm pps_core megaraid_sas dca drm_panel_orientation_quirks ahci libahci > libata nfit libnvdimm dm_mirror dm_region_hash dm_log dm_mod > [26038585.616850] CPU: 22 PID: 21088 Comm: gbased Not tainted > 3.10.0-957.el7.hg.3.x86_64 #1 > [26038585.616851] Hardware name: Sugon H620-G30/65N32-US, BIOS 0QL1001207 > 03/03/2021 > [26038585.616853] Call Trace: > [26038585.616861] [<ffffffff86161de9>] dump_stack+0x19/0x1b > [26038585.616866] [<ffffffff85a976c8>] __warn+0xd8/0x100 > [26038585.616868] [<ffffffff85a9780d>] warn_slowpath_null+0x1d/0x20 > [26038585.616870] [<ffffffff85c5df5e>] drop_nlink+0x3e/0x50 > [26038585.616904] [<ffffffffc03f5d08>] xfs_droplink+0x28/0x60 [xfs] > [26038585.616927] [<ffffffffc03f922f>] xfs_remove+0x29f/0x310 [xfs] > [26038585.616930] [<ffffffff85c595a0>] ? take_dentry_name_snapshot+0xf0/0xf0 > [26038585.616951] [<ffffffffc03f3bb7>] xfs_vn_unlink+0x57/0xa0 [xfs] > [26038585.616953] [<ffffffff85c4dcac>] vfs_rmdir+0xdc/0x150 > [26038585.616956] [<ffffffff85c53151>] do_rmdir+0x1f1/0x220 > [26038585.616959] [<ffffffff85c436be>] ? ____fput+0xe/0x10 > [26038585.616964] [<ffffffff85abe820>] ? task_work_run+0xc0/0xe0 > [26038585.616966] [<ffffffff85c54386>] SyS_rmdir+0x16/0x20 > [26038585.616970] [<ffffffff86174ddb>] system_call_fastpath+0x22/0x27 > [26038585.616972] ---[ end trace 23639deaf902c67e ]--- > ... > (5)the call trace is from the "WARN_ON" function below: > void drop_nlink(struct inode *inode) > { > WARN_ON(inode->i_nlink == 0); > inode->__i_nlink--; > if (!inode->i_nlink) > atomic_long_inc(&inode->i_sb->s_remove_count); > } > (6)the call trace above shows at some time earlier, the i_nlink of the parent > direcotry substracted from 0 by 1, because the i_nlink is 32-bit unsigned > int, > it became 0xffffffff, and from then, the parent direcory can only decreasing > the i_nlink rather than increasing due to the LINK_MAX. > > > 3. the root cause of corrupted i_nlink of parent directory > (1)we saw another call trace in dmesg file of the same process that cause the > call trace of "SyS_rmdir" above: > ... > [18317578.683304] gbased invoked oom-killer: gfp_mask=0x200da, order=0, > oom_score_adj=0 > [18317578.683311] gbased cpuset=/ mems_allowed=0-7 > [18317578.683315] CPU: 11 PID: 17701 Comm: gbased Not tainted > 3.10.0-957.el7.hg.3.x86_64 #1 > [18317578.683318] Hardware name: Sugon H620-G30/65N32-US, BIOS 0QL1001207 > 03/03/2021 > [18317578.683320] Call Trace: > [18317578.683330] [<ffffffff86161de9>] dump_stack+0x19/0x1b > [18317578.683334] [<ffffffff8615c812>] dump_header+0x90/0x229 > [18317578.683339] [<ffffffff85bba2f4>] oom_kill_process+0x254/0x3d0 > [18317578.683342] [<ffffffff85bb9d63>] ? oom_unkillable_task+0x93/0x120 > [18317578.683345] [<ffffffff85bb9e46>] ? find_lock_task_mm+0x56/0xc0 > [18317578.683347] [<ffffffff85bbab36>] out_of_memory+0x4b6/0x4f0 > [18317578.683350] [<ffffffff8615d316>] __alloc_pages_slowpath+0x5d6/0x724 > [18317578.683353] [<ffffffff85bc0f15>] __alloc_pages_nodemask+0x405/0x420 > [18317578.683357] [<ffffffff85c11185>] alloc_pages_vma+0xb5/0x200 > [18317578.683361] [<ffffffff85bce3d0>] shmem_alloc_page+0x70/0xc0 > [18317578.683366] [<ffffffff85ac2dab>] ? autoremove_wake_function+0x2b/0x40 > [18317578.683369] [<ffffffff85acbb1b>] ? __wake_up_common+0x5b/0x90 > [18317578.683374] [<ffffffff85d7c6c4>] ? __radix_tree_lookup+0x84/0xf0 > [18317578.683377] [<ffffffff85da00ea>] ? __percpu_counter_compare+0x2a/0x90 > [18317578.683379] [<ffffffff85bd12e1>] shmem_getpage_gfp+0x451/0x840 > [18317578.683382] [<ffffffff85bd19a4>] shmem_write_begin+0x54/0x80 > [18317578.683384] [<ffffffff85bb5d94>] > generic_file_buffered_write+0x124/0x2c0 > [18317578.683386] [<ffffffff85bb86d2>] __generic_file_aio_write+0x1e2/0x400 > [18317578.683389] [<ffffffff85bb8949>] generic_file_aio_write+0x59/0xa0 > [18317578.683392] [<ffffffff85c40633>] do_sync_write+0x93/0xe0 > [18317578.683395] [<ffffffff85c41120>] vfs_write+0xc0/0x1f0 > [18317578.683397] [<ffffffff85c41f3f>] SyS_write+0x7f/0xf0 > [18317578.683401] [<ffffffff86174ddb>] system_call_fastpath+0x22/0x27 > [18317578.683402] Mem-Info: > [18317578.683486] active_anon:59939847 inactive_anon:3882578 isolated_anon:0 > ... > (2)the call trace shows this process was killed due to the "oom", we suspect > if > at the time this process being kill, its other threads(other than the > "SyS_write" thread that the call trace shows) was doing concurrent rmdir or > mkdir under the parent direcotry, the kill will cause the corrupted i_nlink > of > the parent directory, and we simulate this "oom" situation where multithread > do > concurrent mkdir and rmdir under parent directory, but the problem can not > reproduce at all. > (3)the dmesg file also shows an error related to "power saving mode": > ... > [23647870.874579] Uhhuh. NMI received for unknown reason 3d on CPU 56. > [23647870.874624] Do you have a strange power saving mode enabled? > [23647870.874650] Dazed and confused, but trying to continue > ... > (4)we are simulating this "power saving mode" error to determine if this can > cause the corrupted i_nlink problem, this is in progressing. > (5)the problematic environment now repaired by hand throught the xfs_db tool, > we manually modify the corrupted i_nlink of the parent directory to the > correct > value. > (6)in short, by now we still confusing why the corrupted i_nlink of the > parent > can happen. > > > 4. attachment descriptions > (1)the screenshot of the problematic environment that shows the corrupted > i_nlink of the parent directory. > (2)the dmesg file. > > > 5. other informations > (1)the similar problem that caused on ext4 filesystem: > > https://lkml.kernel.org/lkml/4febf11b-31ea-82a1-bf08-b6bebe08bc75@xxxxxxxxxx/T/ > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are watching the assignee of the bug. -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.