Re: After unlinking a large file on ext4, the process stalls for a long time

Andreas Dilger <adilger@xxxxxxxxx> · Wed, 16 Jul 2014 21:37:59 -0600

On Jul 16, 2014, at 11:16 AM, Mason <mpeg.blue@xxxxxxx> wrote:
> (I hope you'll forgive me for reformatting the quote characters
> to my taste.)

Thank you.

> On 16/07/2014 17:16, John Stoffel wrote:
>> Mason wrote:
>>> I'm using Linux (3.1.10 at the moment) on a embedded system
>>> similar in spec to a desktop PC from 15 years ago (256 MB RAM,
>>> 800-MHz CPU, USB).
>> 
>> Sounds like a Raspberry Pi...  And have you investigated using
>> something like XFS as your filesystem instead?
> 
> The system is a set-top box (DVB-S2 receiver). The system CPU is
> MIPS 74K, not ARM (not that it matters, in this case).
> 
> No, I have not investigated other file systems (yet).
> 
>>> I need to be able to create large files (50-1000 GB) "as fast
>>> as possible".  These files are created on an external hard disk
>>> drive, connected over Hi-Speed USB (typical throughput 30 MB/s).
>> 
>> Really... so you just need to create allocations of space as quickly
>> as possible,
> 
> I may not have been clear. The creation needs to be fast (in UX terms,
> so less than 5-10 seconds), but it only occurs a few times during the
> lifetime of the system.
> 
>> which will then be filled in later with actual data?
> 
> Yes. In fact, I use the loopback device to format the file as an
> ext4 partition. 
> 
> The use case is
> - allocate a large file
> - stick a file system on it
> - store stuff (typically video files) inside this "private" FS
> - when the user decides he doesn't need it anymore, unmount and unlink
> (I also have a resize operation in there, but I wanted to get the
> basics before taking the hard stuff head on.)
> 
> So, in the limit, we don't store anything at all: just create and
> immediately delete. This was my test.

I would agree that LVM is the real solution that you want to use.
It is specifically designed for this, and has much less overhead than
a filesystem on a loopback device on a file on another filesystem.
The amount of space overhead is tuneable, but typically the volumes
are allocated in multiples of 4MB chunks.

That said, I think you've found some kind of strange performance problem,
and it is worthwhile to figure this out.

>>> /tmp # time ./foo /mnt/hdd/xxx 5
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>> 
>>> /tmp # time ./foo /mnt/hdd/xxx 10
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>> 
>>> /tmp # time ./foo /mnt/hdd/xxx 100
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>> 
>>> /tmp # time ./foo /mnt/hdd/xxx 300
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Firstly, have you tried using "fallocate()" directly, instead of
posix_fallocate()?  It may be (depending on your userspace) that
posix_fallocate() is writing zeroes to the file instead of using
the fallocate() syscall, and the kernel is busy cleaning up all
of the dirty pages when the file is unlinked.  You could try using
strace to see what system calls are actually being used.

Secondly, where is the process actually stuck?  From your output
above, the unlink() call takes no measurable time before returning,
so I don't see where it is actually stuck.  Again, running your
test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
syscall is actually taking so much time to complete.  I don't
think it is unlink().

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail