Re: After unlinking a large file on ext4, the process stalls for a long time

Lukáš Czerner <lczerner@xxxxxxxxxx> · Thu, 17 Jul 2014 12:40:16 +0200 (CEST)

On Thu, 17 Jul 2014, Mason wrote:

> Date: Thu, 17 Jul 2014 12:30:34 +0200
> From: Mason <mpeg.blue@xxxxxxx>
> To: Andreas Dilger <adilger@xxxxxxxxx>
> Cc: Ext4 Developers List <linux-ext4@xxxxxxxxxxxxxxx>,
>     linux-fsdevel <linux-fsdevel@xxxxxxxxxxxxxxx>
> Subject: Re: After unlinking a large file on ext4,
>     the process stalls for a long time
> 
> Hello,
> 
> Andreas Dilger wrote:
> 
> > Mason wrote:
> > 
> >> The use case is
> >> - allocate a large file
> >> - stick a file system on it
> >> - store stuff (typically video files) inside this "private" FS
> >> - when the user decides he doesn't need it anymore, unmount and unlink
> >> (I also have a resize operation in there, but I wanted to get the
> >> basics before taking the hard stuff head on.)
> >> 
> >> So, in the limit, we don't store anything at all: just create and
> >> immediately delete. This was my test.
> > 
> > I would agree that LVM is the real solution that you want to use.
> > It is specifically designed for this, and has much less overhead than
> > a filesystem on a loopback device on a file on another filesystem.
> > The amount of space overhead is tuneable, but typically the volumes
> > are allocated in multiples of 4MB chunks.
> 
> I'll take a look at LVM. (But, at this point, it's too late to change
> the architecture of the system.)
> 
> > That said, I think you've found some kind of strange performance problem,
> > and it is worthwhile to figure this out.
> > 
> >>>> /tmp # time ./foo /mnt/hdd/xxx 5
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 10
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 100
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 300
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> 
> Preliminary info:
> 
> The partition was created/mounted with
> $ mkfs.ext4 -m 0 -i 1024000 -L ZOZO -O ^has_journal,^huge_file /dev/sda1
> $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
> (mount is busybox, in case it matters)
> 
> mke2fs 1.42.10 (18-May-2014)
> /dev/sda1 contains a ext4 file system labelled 'ZOZO'
>         last mounted on /mnt/hdd on Wed Jul 16 15:40:40 2014
> Proceed anyway? (y,n) y
> Creating filesystem with 104857600 4k blocks and 460800 inodes
> Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
> Superblock backups stored on blocks:
>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>         102400000
> 
> Allocating group tables: done
> Writing inode tables: done
> Writing superblocks and filesystem accounting information: done
> 
> /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
> /* No support for xattr in this kernel */
> 
> # dumpe2fs -h /dev/sda1
> dumpe2fs 1.42.10 (18-May-2014)
> Filesystem volume name:   ZOZO
> Last mounted on:          <not available>
> Filesystem UUID:          8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg dir_nlink extra_isize
> Filesystem flags:         signed_directory_hash
> Default mount options:    user_xattr acl
> Filesystem state:         not clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              460800
> Block count:              104857600
> Reserved block count:     0
> Free blocks:              104803944
> Free inodes:              460789
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Reserved GDT blocks:      999
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         144
> Inode blocks per group:   9
> Flex block group size:    16
> Filesystem created:       Thu Jul 17 11:14:27 2014
> Last mount time:          Thu Jul 17 11:14:29 2014
> Last write time:          Thu Jul 17 11:14:29 2014
> Mount count:              1
> Maximum mount count:      -1
> Last checked:             Thu Jul 17 11:14:27 2014
> Check interval:           0 (<none>)
> Lifetime writes:          4883 kB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group unknown)
> First inode:              11
> Inode size:               256
> Required extra isize:     28
> Desired extra isize:      28
> Default directory hash:   half_md4
> Directory Hash Seed:      157f2107-76fc-417b-9a07-491951c873b7
> 
> > Firstly, have you tried using "fallocate()" directly, instead of
> > posix_fallocate()?  It may be (depending on your userspace) that
> > posix_fallocate() is writing zeroes to the file instead of using
> > the fallocate() syscall, and the kernel is busy cleaning up all
> > of the dirty pages when the file is unlinked.  You could try using
> > strace to see what system calls are actually being used.
> 
> Unfortunately, I'm using a prehistoric version of glibc (2.8)
> that doesn't support the fallocate wrapper (imported in 2.10).
> 
> I'm 70% sure that posix_fallocate() is not actually writing zeros
> to the file, because when I tested it on ext2, creating a 300-GB
> file took hours, literally (approx. 3 hours). The same operation
> on ext4 takes a few seconds. (Although, now that I think of it,
> it could be working asynchronously, or defer some operation, that
> I eventually have to pay for on deletion.)
> 
> # time strace -tt -T ./foo /mnt/hdd/xxx 300 2> strace.out
> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [414 ms]
> unlink(filename): 0 [1 ms]
> 
> 
> 12:23:27.218838 open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000486>
> 12:23:27.220121 clock_gettime(CLOCK_MONOTONIC, {79879, 926227018}) = 0 <0.000105>
> 12:23:27.221029 SYS_4320()              = 0 <0.412013>
> 12:23:27.633673 clock_gettime(CLOCK_MONOTONIC, {79880, 339646593}) = 0 <0.000104>
> 12:23:27.634657 fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000116>
> 12:23:27.636187 ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000146>
> 12:23:27.637509 old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77248000 <0.000143>
> 12:23:27.638306 write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000237>
> 12:23:27.639496 clock_gettime(CLOCK_MONOTONIC, {79880, 345448452}) = 0 <0.000102>
> 12:23:27.640168 unlink("/mnt/hdd/xxx")  = 0 <0.000231>
> 12:23:27.641174 clock_gettime(CLOCK_MONOTONIC, {79880, 347202581}) = 0 <0.000100>
> 12:23:27.641984 write(1, "unlink(filename): 0 [1 ms]\n", 27) = 27 <0.000157>
> 12:23:27.643056 exit_group(0)           = ?
> 0.02user 111.51system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 864maxresident)k
> 0inputs+0outputs (0major+459minor)pagefaults 0swaps

So it really does not seem to be stalling in fallocate, nor unlink.
Can you add close() before unlink, just to be sure what's happening
there ?

Thanks!
-Lukas

> 
> 
> AFAICT, SYS_4320() is fallocate.
> 
> /*
>  * Linux o32 style syscalls are in the range from 4000 to 4999.
>  */
> #define __NR_Linux  4000
> #define __NR_fallocate  (__NR_Linux + 320)
> 
> 
> Where is the process stalling? That is a mystery. Seems it's stuck
> in exit_group(), waiting for the kernel to clean up on its behalf?
> Maybe I need ftrace, or something to profile the kernel?
> 
> > Secondly, where is the process actually stuck?  From your output
> > above, the unlink() call takes no measurable time before returning,
> > so I don't see where it is actually stuck.  Again, running your
> > test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
> > syscall is actually taking so much time to complete.  I don't
> > think it is unlink().
> 
> See above, the process is stalled, but I don't know where!
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html