在 2012年2月15日 上午5:24,Aditya Kali <adityakali@xxxxxxxxxx> 写道: > Hi Robin, > > > On Tue, Feb 14, 2012 at 12:01 AM, Robin Dong <hao.bigrat@xxxxxxxxx> wrote: >> >> 在 2012年2月10日 上午10:55,Ted Ts'o <tytso@xxxxxxx> 写道: >> > On Thu, Feb 02, 2012 at 03:48:49PM +0800, Robin Dong wrote: >> >> >> >> The reason is ext4_has_free_clusters reporting wrong >> >> result. Actually, the unit of argument "dirty_clusters" is block, so >> >> we should tranform the sbi->s_dirtyclusters_counter to block , just >> >> like "free_clusters". >> > >> > I've been looking at the the delayed allocation reservation code, and >> > I think it's a lot more confused that this. Some places we are >> > playing with fields as if they are clusters, and some times as if they >> > are blocks, and I've definitely found places where we're not handling >> > quota correctly with bigalloc (ext4_map_blocks is calling >> > ext4_da_update_reserve_space() with a value which is clearly in units >> > of blocks, but we are treating that vale as if it is clusters, etc.) >> > >> > So I think the problem is a lot larger than what you pointed out, and >> > we shouldn't fix it just by throwing in an EXT4_C2B call. If >> > s_dirtyclusters_counter should be denominated in blocks, then we need >> > to rename it. And there there's this comment which made me wince: >> > >> > * Note that in case of bigalloc, i_reserved_meta_blocks, >> > * i_reserved_data_blocks, etc. refer to number of clusters. >> > >> > Argh, no. If something is denominated in clusters, then it should be >> > appropriately named as such. >> > >> > Actually, I think the problem is a lot bigger than this, and it's a >> > conceptual one. When we try writing to a block which does not have a >> > mapping, and bigalloc is enabled, there are three cases: >> > >> > 1) This is the first time the cluster has ever been written to; hence, >> > even though we are dirtying a block, we need to reserve room for the >> > entire cluster. >> > >> > 2) While the block has not been mapped (and so the data on disk is >> > uninitialized) the cluster has indeed been allocated, so we don't need >> > to reserve any additional room for the block. >> > >> > 3) One or more blocks in the cluster have already been written to via >> > delayed allocation, but the cluster itself has not been selected >> > (allocated) yet. But since the space for the entire cluster was >> > reserved when the first block in the cluster was dirtied (case #1), we >> > don't need to reserve any more space. >> > >> > We aren't handling these cases correctly today, since we don't have >> > this logic here. These cases don't matter if you're only using >> > fallocate() and direct I/O, but if you are using buffered writes and >> > delayed allocation, they obviously matter a huge amount. >> > >> >> > And I don't think we're handling it correctly. Argh... >> > > > I thought we do handle these cases correctly after the 'ext4: attempt > to fix race in bigalloc code path' change. May be I missed something > or the code changed somewhere. If there is a testcase that fails, I > can take a look. > >> > - Ted >> >> Hi, Ted and Aditya >> >> I am trying to fix this "bigalloc quota reservation in >> delay-allocation" problem recently, and I feel confusion about these >> code below in ext4_ext_map_blocks(): >> >> if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) { >> unsigned int reserved_clusters; >> /* >> * Check how many clusters we had reserved this allocated range >> */ >> reserved_clusters = get_reserved_cluster_alloc(inode, >> map->m_lblk, allocated); >> if (map->m_flags & EXT4_MAP_FROM_CLUSTER) { >> if (reserved_clusters) { >> ...... >> ext4_da_update_reserve_space(inode, >> reserved_clusters, 0); >> } >> } else { >> BUG_ON(allocated_clusters < reserved_clusters); >> /* We will claim quota for all newly allocated blocks.*/ >> ext4_da_update_reserve_space(inode, allocated_clusters, >> 1); >> if (reserved_clusters < allocated_clusters) { >> struct ext4_inode_info *ei = EXT4_I(inode); >> int reservation = allocated_clusters - >> reserved_clusters; >> dquot_reserve_block(inode, >> EXT4_C2B(sbi, reservation)); >> spin_lock(&ei->i_block_reservation_lock); >> ei->i_reserved_data_blocks += reservation; >> spin_unlock(&ei->i_block_reservation_lock); >> } >> } >> } >> >> if we have allocated 3 clusters, why should we add "reservation" back >> to i_reservd_data_blocks ? > Because we will end up here again next time when we write blocks > [10-11] and try to claim space for them. At this time, there will be > no more reserved blocks left (unless you give them back) and you will > hit the warning in ext4_da_update_reserve_space(): > > if (unlikely(used > ei->i_reserved_data_blocks)) { > ext4_msg(inode->i_sb, KERN_NOTICE, "%s: ino %lu, used %d " > "with only %d reserved data blocks\n", > __func__, inode->i_ino, used, > ei->i_reserved_data_blocks); > WARN_ON(1); > used = ei->i_reserved_data_blocks; > } > > >> (Aditya Kali 's long comment dose not make >> sense to me) Maybe we don't need to do ext4_da_update_reserve_space >> when the new block is allocated from "implied_cluster"? Why not just >> change them to: >> >> if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) { >> if ( !(map->m_flags & EXT4_MAP_FROM_CLUSTER) ) >> /* We will claim quota for all newly allocated blocks.*/ >> ext4_da_update_reserve_space(inode, allocated_clusters, >> 1); >> } >> > > This won't work in all cases, like the one given in the long comment. > With above change, ext4_da_update_reserve_space() will get called > again the next time for some delayed allocated blocks in the same > cluster, but there isn't any reserved blocks left to claim. > The flag 'EXT4_MAP_FROM_CLUSTER' is not a sufficient check to > determine 'how many' clusters we need to claim. This is because > get_implied_cluster_alloc() only checks within already allocated > extents (and not in the page cache). So, I had to add the function > get_reserved_cluster_alloc() to check if there are any delayed > allocated blocks still in the page-cache. But I found the code in ext4_ext_map_blocks(): if ((sbi->s_cluster_ratio > 1) && ext4_find_delalloc_cluster(inode, map->m_lblk, 0)) map->m_flags |= EXT4_MAP_FROM_CLUSTER; thus, even the delayed allocated block in the page-cache (have not be allocated) already have been checked. Therefore in my opinion, the EXT4_MAP_FROM_CLUSTER is a sufficient check. Follow the case of your long comment :) , process will go like this in MY CODE ABOVE : [0-3], [4-7], [8-11] 1. delay-allocation write blocks 10&11 we reserve 1 cluster, the i_reserved_data_blocks is 1 now 2. delay-allocation write blocks 3 to 8 we reserve other 2 clusters, so 3 clusters totally, the i_reserved_data_blocks is 3 now 3. writeout the blocks 3 to 8 claim all 3 clusters, i_reserved_data_blocks is 0 now At this moment, we really have allocated 3 clusters, and the remain 10&11 block would never occupy another cluster (it would definitely go into the [8-11] cluster), so, why need we reserve one more cluster quota ? 4. writeout the 10&11 blocks the 10&11 blocks will be set EXT4_MAP_FROM_CLUSTER flag ( by get_implied_cluster_alloc as the 8~9 block have been allocated), and it will not call ext4_da_update_reserve_space any more As this, no warning, no reserve-and-release ping pong....at least in my imaging Could you please point out what I missed ? Thanks > You will also miss the other case (when EXT4_MAP_FROM_CLUSTER is set) > where we had reserved the cluster, but ended up not allocating it. > Since this cluster was not allocated, we need to update the reserved > space without claiming quota (ext4_da_update_reserve_space(inode, > reserved_clusters, 0) --- notice the last argument is '0'). > > I know its kinda messy, but I think it works. Also, since we don't > keep track of delayed extents, its harder to check in one go. > > As for your proposed fix earlier, the correct solution will be to not > convert 'free_clusters' value into blocks. > > diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c > index f9e2cd8..2c518f8 100644 > --- a/fs/ext4/balloc.c > +++ b/fs/ext4/balloc.c > @@ -426,7 +426,7 @@ static int ext4_has_free_clusters(struct ext4_sb_info *sbi, > > if (free_clusters - (nclusters + root_clusters + dirty_clusters) < > EXT4_FREECLUSTERS_WATERMARK) { > - free_clusters = EXT4_C2B(sbi, > percpu_counter_sum_positive(fcc)); > + free_clusters = percpu_counter_sum_positive(fcc); > dirty_clusters = percpu_counter_sum_positive(dcc); > } > /* Check whether we have space after accounting for current > > > We were incorrectly converting clusters to blocks and so estimating > that we have more free clusters available than what we actually do. > Hence the dmesg errors at write time. > >> >> -- >> -- >> Best Regard >> Robin Dong > > > Thanks, > Aditya -- -- Best Regard Robin Dong -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html