On Mon, Apr 06, 2015 at 01:31:41PM -0600, Jun He wrote: > This patch addresses the tail allocation problem found at paper "Reducing File System Tail Latencies with Chopper". https://www.usenix.org/system/files/conference/fast15/fast15-paper-he.pdf . The paper refers the tail allocation problem as the Special End problem. Hi Jun He, Kernel patches should have the commit body line-wrapped at 72 columns (more or less). The idea is that the commit should be easily readable on an 80 column screen, with some extra space left for identation (git log will ident the commit body by 4 characters). > A tail extent is the last extent of a file. The last block of the > tail extent corresponds to the last logical block of the file. When > a file is closed and the tail extent is being allocated, ext4 marks > the allocation request with the hint EXT4_MB_HINT_NOPREALLOC. The > intention is to avoid preallocation when we know the file is final > (it won't change until the next open.). But the implementation leads > to some problems. The following program attacks the problem. I think you mean "the following program _demonstrates_ the problem". > EXT4_MB_HINT_NOPREALLOC is not the right hint for special end. If we > remove current check for tail in ext4_mb_group_or_file() , 1, 3, 4 > will be satisfied. Now, we only need to handle the tail of large > file (2) by checking tail when normalizing. If it is a tail of large > file, we simply do not normalize it. The problem with skipping the normalization is this also by passes the goal block optimizations based on pleft, pright, lleft, and lright. In the case of a small file, it's not always true that you want to allocate from a LG preallocation. If there is free space immediately adjacent to file, we want to use that in preference to the LG preallocation --- since that would result in a contiguous file. > + > + /* don't normalize tail of large files */ > + if ((size == isize) && > + !ext4_fs_is_busy(sbi) && > + (atomic_read(&ac->ac_inode->i_writecount) == 0)) { > + return; > + } The comment is a bit misleading; it's not checking for large files at all. It also looks like that you copied the ext4_fs_is_busy(sbi) from the original check. Perhaps it would be helpful to explain the design goal of the original code. As originally conceived, the problem we were trying to fix was that in the case of unpacking a tar file with a large number of small files, we wanted to make sure all of the files would be packed togehter, without extra space reserved (caused by the normalization). This was achieved by disabing the use of the locality group. The problem is that if there are a large number of processes writing into the same block group, this would cause contention, and so in the case where we had contention, we would use the locality group (since the whole point was to eliminate the need to constantly take the block group lock while allocaitng out of the block group; that's why the locality groups are per-cpu). That's what the ext4_fs_is_busy() is all about. We try to measure contention and disable the optimization if that was the case. I suspect we may want to take a step back and what's the best way of handling all of the high level goals, and refomulate a better strategy: 1) If the file doesn't have any blocks allocated yet, and it is closed by the time we need to allocate all of its blocks (thanks to delayed allocation), we should allocate exactly the number of blocks needed, and try to avoid fragmenting free space. 2) In the case where we are doing any tails at all, it's a different optimization altogether, and the main goal should be to make sure the tail is connected the previous blocks if at all possible. To the extent that we have per-CPU LG groups, this can be... problematic. 3) The primary goal of the per-group locality group is to avoid lock contention when multiple CPU's are trying to access the blocks at the same time. Depending on why we are doing block allocation, this may not be an issue. We should probably into account why it is we re doing block allocation at particular moment: *) In the case of delayed allocation w/o fsync(2), the writeback will be taking place from the writeback daemon. In that case, all of allocations will be issued from a writeback thread, so contention is much less of an issue. In addition, in the case where the entire file has been written within 30 seconds, we will be allocating the entire file at once, as described in (1). This is a fairly common case, so we should optimize for it. *) In the case of block allocations resulting from an fallocate(2), we should assume userspace knew what it was doing, and we shouldn't try to do any kind of size normalization. Also, fallocate() calls generally don't happen with high frequency, so worrying about CPU scalability is probalby not a major issue. *) In the case of nodelalloc (or ext3 compatibility mode). In that case, we will be doing lots of singleton allocation, and it's in this case where normalization and preallocation is most important. *) The Lustre file calls ext4_map_blocks() directly. We'll need to check with Andreas, but the user is writing a lot of small files over Lustre, it may be that lock contention will be an issue, and we might want to use LG preallocation groups. *) Writes caused by fsync(2). (For example, from SQLite files.) In this case, lock contention might very well be an issue, and use of the LG preallocation group some kind of file normalization may very well be a good idea. In fact, we might want to have a hueristic which notices files which are written using either a slow append or an append followed by an fsync workload, and make sure we use an inode perallocation group, and in the case of a slow append workload, we might want to hold the preallocation over a file close (with some way of releasing blocks in an ENOSPC situation). Some of the mballoc failure modes which you found in the Chopper paper was specifically we weren't taking these various different scenarios into account, and so we used an optimization that worked well when we were allocating the whole file at a time (thanks to delalloc and writebacks), and tht hueristic failed miserably in the case of large files where we happened to be writing a tail. So before we start playing with changing hueristics, it would be a good idea if we make sure the block allocator has enough context so it understands how it is being called and by whom, and make sure we that we don't use the wrong hueristic by accident. Does this make sense? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html