On 2015-05-09 06:09, Ryusuke Konishi wrote: > On Sun, 3 May 2015 12:05:18 +0200, Andreas Rohner wrote: >> This patch adds a cache for the SUFILE to efficiently store lots of >> small changes to su_nlive_blks in memory and apply the accumulated >> results later at segment construction. This improves performance of >> these operations and reduces lock contention in the SUFILE. >> >> The implementation uses a radix_tree to store cache nodes, which >> contain a certain number of values. Every value corresponds to >> exactly one SUFILE entry. If the cache is flushed the values are >> subtracted from the su_nlive_blks field of the corresponding SUFILE >> entry. >> >> If the parameter only_mark of the function nilfs_sufile_flush_cache() is >> set, then the blocks that would have been dirtied by the flush are >> marked as dirty, but nothing is actually written to them. This mode is >> useful during segment construction, when blocks need to be marked dirty >> in advance. >> >> New nodes are allocated on demand. The lookup of nodes is protected by >> rcu_read_lock() and the modification of values is protected by a block >> group lock. This should allow for concurrent updates to the cache. >> >> Signed-off-by: Andreas Rohner <andreas.rohner@xxxxxxx> >> --- >> fs/nilfs2/sufile.c | 369 +++++++++++++++++++++++++++++++++++++++++++++++++++++ >> fs/nilfs2/sufile.h | 5 + >> 2 files changed, 374 insertions(+) >> >> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c >> index 1cce358..80bbd87 100644 >> --- a/fs/nilfs2/sufile.c >> +++ b/fs/nilfs2/sufile.c >> @@ -26,6 +26,7 @@ >> #include <linux/string.h> >> #include <linux/buffer_head.h> >> #include <linux/errno.h> >> +#include <linux/radix-tree.h> >> #include <linux/nilfs2_fs.h> >> #include "mdt.h" >> #include "sufile.h" >> @@ -42,6 +43,11 @@ struct nilfs_sufile_info { >> unsigned long ncleansegs;/* number of clean segments */ >> __u64 allocmin; /* lower limit of allocatable segment range */ >> __u64 allocmax; /* upper limit of allocatable segment range */ >> + > >> + struct blockgroup_lock nlive_blks_cache_bgl; >> + spinlock_t nlive_blks_cache_lock; >> + int nlive_blks_cache_dirty; >> + struct radix_tree_root nlive_blks_cache; > > blockgroup_lock is not needed. For the counter operations in this > patch, using cmpxchg() or atomic_xxx() is more effective as I mention > later. > > And, I prefer to address this cache as updates of segment usage > instead of that of nlive_blks. In that sense, it's preferable > to define the array element like: > > struct nilfs_segusage_update { > __u32 nlive_blks_adj; > }; Great idea! > and define the variable names like update_cache (instead of > nlive_blks_cache), update_cache_lock, update_cache_dirty, etc. I really like this suggestion. I was struggling to come up with good names for all the cache related functions. > >> }; >> >> static inline struct nilfs_sufile_info *NILFS_SUI(struct inode *sufile) >> @@ -1194,6 +1200,362 @@ out_sem: >> } >> >> /** >> + * nilfs_sufile_alloc_cache_node - allocate and insert a new cache node >> + * @sufile: inode of segment usage file >> + * @group: group to allocate a node for >> + * >> + * Description: Allocates a new cache node and inserts it into the cache. If >> + * there is an error, nothing will be allocated. If there already exists >> + * a node for @group, no new node will be allocated. >> + * >> + * Return Value: On success, 0 is returned, on error, one of the following >> + * negative error codes is returned. >> + * >> + * %-ENOMEM - Insufficient amount of memory available. >> + */ >> +static int nilfs_sufile_alloc_cache_node(struct inode *sufile, >> + unsigned long group) >> +{ >> + struct nilfs_sufile_info *sui = NILFS_SUI(sufile); >> + struct nilfs_sufile_cache_node *node; >> + int ret; >> + >> + node = kmem_cache_alloc(nilfs_sufile_node_cachep, GFP_NOFS); >> + if (!node) >> + return -ENOMEM; >> + >> + ret = radix_tree_preload(GFP_NOFS); >> + if (ret) >> + goto free_node; >> + >> + spin_lock(&sui->nlive_blks_cache_lock); >> + ret = radix_tree_insert(&sui->nlive_blks_cache, group, node); >> + spin_unlock(&sui->nlive_blks_cache_lock); >> + >> + radix_tree_preload_end(); >> + > >> + if (ret == -EEXIST) { >> + ret = 0; >> + goto free_node; >> + } else if (ret) >> + goto free_node; >> + >> + return 0; >> +free_node: >> + kmem_cache_free(nilfs_sufile_node_cachep, node); >> + return ret; > > The above error check implies two branches in regular path. > Consider rewriting it as follows: > > if (!ret) > return 0; > > if (ret == -EEXIST) > ret = 0; > free_node: > kmem_cache_free(nilfs_sufile_node_cachep, node); > return ret; Ok. > By the way, you should use braces in both branches if the one of them > has multiple statements in an "if else" conditional statement. This > exception is written in the Chapter 3 of Documentation/CodingStyle. > > e.g. > > if (condition) { > do_this(); > do_that(); > } else { > otherwise(); > } Ok. >> +} >> + >> +/** >> + * nilfs_sufile_dec_nlive_blks - decrements nlive_blks in the cache >> + * @sufile: inode of segment usage file >> + * @segnum: segnum for which nlive_blks will be decremented >> + * >> + * Description: Decrements the number of live blocks for @segnum in the cache. >> + * This function only affects the cache. If the cache is not flushed at a >> + * later time the changes are lost. It tries to lookup the group node to >> + * which the @segnum belongs in a lock free manner and uses a blockgroup lock >> + * to do the actual modification on the node. >> + * >> + * Return Value: On success, 0 is returned on error, one of the following >> + * negative error codes is returned. >> + * >> + * %-ENOMEM - Insufficient amount of memory available. >> + */ >> +int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum) >> +{ >> + struct nilfs_sufile_info *sui = NILFS_SUI(sufile); >> + struct nilfs_sufile_cache_node *node; >> + spinlock_t *lock; >> + unsigned long group; >> + int ret; >> + >> + group = (unsigned long)(segnum >> NILFS_SUFILE_CACHE_NODE_SHIFT); >> + >> +try_again: >> + rcu_read_lock(); >> + node = radix_tree_lookup(&sui->nlive_blks_cache, group); >> + if (!node) { >> + rcu_read_unlock(); >> + >> + ret = nilfs_sufile_alloc_cache_node(sufile, group); >> + if (ret) >> + return ret; >> + >> + /* >> + * It is important to acquire the rcu_read_lock() before using >> + * the node pointer >> + */ >> + goto try_again; >> + } >> + > >> + lock = bgl_lock_ptr(&sui->nlive_blks_cache_bgl, (unsigned int)group); >> + spin_lock(lock); >> + node->values[segnum & ((1 << NILFS_SUFILE_CACHE_NODE_SHIFT) - 1)] += 1; >> + sui->nlive_blks_cache_dirty = 1; >> + spin_unlock(lock); >> + rcu_read_unlock(); >> + >> + return 0; >> +} > > Consider using cmpxchg() or atomic_inc(), and using > NILFS_SUFILE_CACHE_NODE_MASK to mask segnum. The following is an > example in the case of using cmpxchg(): > > __u32 old, new, *valuep; > ... > old = node->values[segnum & (NILFS_SUFILE_CACHE_NODE_COUNT - 1)]; > do { > old = ACCESS_ONCE(*valuep); > new = old + 1; > } while (cmpxchg(valuep, old, new) != old); > > sui->nlive_blks_cache_dirty = 1; > > rcu_read_unlock(); > return 0; > } > > The current atomic_xxxx() macros are actually defined in the same way > to the reduce overheads in smp environment. > > Using atomic_xxxx() is more preferable but formally it requires > initialization with "atomic_set(&counter, 0)" or "ATOMIC_INIT(0)" for > every element. I don't know whether initialization with memset() > function is allowed or not for atomic_t type variables. Ok, but I would also have to use cmpxchg() in nilfs_sufile_flush_cache_node(), if the value needs to be set to 0. Currently the cache is only flushed during segment construction, so there should be no concurrent calls to nilfs_sufile_dec_nlive_blks(). But I thought it would be best do design a thread safe flush function. >> + >> +/** >> + * nilfs_sufile_flush_cache_node - flushes one cache node to the SUFILE >> + * @sufile: inode of segment usage file >> + * @node: cache node to flush >> + * @only_mark: do not write anything, but mark the blocks as dirty >> + * @pndirty_blks: pointer to return number of dirtied blocks >> + * >> + * Description: Flushes one cache node to the SUFILE and also clears the cache >> + * node at the same time. If @only_mark is 1, nothing is written to the >> + * SUFILE, but the blocks are still marked as dirty. This is useful to mark >> + * the blocks in one phase of the segment creation and write them in another. >> + * >> + * Return Value: On success, 0 is returned on error, one of the following >> + * negative error codes is returned. >> + * >> + * %-ENOMEM - Insufficient memory available. >> + * >> + * %-EIO - I/O error >> + * >> + * %-EROFS - Read only filesystem (for create mode) >> + */ >> +static int nilfs_sufile_flush_cache_node(struct inode *sufile, >> + struct nilfs_sufile_cache_node *node, >> + int only_mark, >> + unsigned long *pndirty_blks) >> +{ >> + struct nilfs_sufile_info *sui = NILFS_SUI(sufile); >> + struct buffer_head *su_bh; >> + struct nilfs_segment_usage *su; >> + spinlock_t *lock; >> + void *kaddr; >> + size_t n, i, j; >> + size_t susz = NILFS_MDT(sufile)->mi_entry_size; >> + __u64 segnum, seg_start, nsegs; >> + __u32 nlive_blocks, value; >> + unsigned long secs = get_seconds(), ndirty_blks = 0; >> + int ret, dirty; >> + >> + nsegs = nilfs_sufile_get_nsegments(sufile); >> + seg_start = node->index << NILFS_SUFILE_CACHE_NODE_SHIFT; >> + lock = bgl_lock_ptr(&sui->nlive_blks_cache_bgl, node->index); >> + >> + for (i = 0; i < NILFS_SUFILE_CACHE_NODE_COUNT;) { >> + segnum = seg_start + i; >> + if (segnum >= nsegs) >> + break; >> + >> + n = nilfs_sufile_segment_usages_in_block(sufile, segnum, >> + seg_start + NILFS_SUFILE_CACHE_NODE_COUNT - 1); >> + >> + ret = nilfs_sufile_get_segment_usage_block(sufile, segnum, >> + 0, &su_bh); >> + if (ret < 0) { >> + if (ret != -ENOENT) >> + return ret; >> + /* hole */ >> + i += n; >> + continue; >> + } >> + >> + if (only_mark && buffer_dirty(su_bh)) { >> + /* buffer already dirty */ >> + put_bh(su_bh); >> + i += n; >> + continue; >> + } >> + >> + spin_lock(lock); >> + kaddr = kmap_atomic(su_bh->b_page); >> + >> + dirty = 0; >> + su = nilfs_sufile_block_get_segment_usage(sufile, segnum, >> + su_bh, kaddr); >> + for (j = 0; j < n; ++j, ++i, su = (void *)su + susz) { >> + value = node->values[i]; >> + if (!value) >> + continue; >> + if (!only_mark) >> + node->values[i] = 0; >> + >> + WARN_ON(nilfs_segment_usage_error(su)); >> + >> + nlive_blocks = le32_to_cpu(su->su_nlive_blks); >> + if (!nlive_blocks) >> + continue; >> + >> + dirty = 1; >> + if (only_mark) { >> + i += n - j; >> + break; >> + } >> + > >> + if (nlive_blocks <= value) >> + nlive_blocks = 0; >> + else >> + nlive_blocks -= value; > > This can be simplified as below: > > nlive_blocks -= min_t(__u32, nlive_blocks, value); Ok. >> + >> + su->su_nlive_blks = cpu_to_le32(nlive_blocks); >> + su->su_nlive_lastmod = cpu_to_le64(secs); >> + } >> + >> + kunmap_atomic(kaddr); >> + spin_unlock(lock); >> + >> + if (dirty && !buffer_dirty(su_bh)) { >> + mark_buffer_dirty(su_bh); > >> + nilfs_mdt_mark_dirty(sufile); > > nilfs_mdt_mark_dirty() should be called only once if ndirty_blks is > larger than zero. We can move it to nilfs_sufile_flush_cache() side > (to the position just before calling up_write()). Good idea. >> + ++ndirty_blks; >> + } >> + >> + put_bh(su_bh); >> + } >> + >> + *pndirty_blks += ndirty_blks; >> + return 0; >> +} >> + >> +/** >> + * nilfs_sufile_flush_cache - flushes cache to the SUFILE >> + * @sufile: inode of segment usage file >> + * @only_mark: do not write anything, but mark the blocks as dirty >> + * @pndirty_blks: pointer to return number of dirtied blocks >> + * >> + * Description: Flushes the whole cache to the SUFILE and also clears it >> + * at the same time. If @only_mark is 1, nothing is written to the >> + * SUFILE, but the blocks are still marked as dirty. This is useful to mark >> + * the blocks in one phase of the segment creation and write them in another. >> + * If there are concurrent inserts into the cache, it cannot be guaranteed, >> + * that everything is flushed when the function returns. >> + * >> + * Return Value: On success, 0 is returned on error, one of the following >> + * negative error codes is returned. >> + * >> + * %-ENOMEM - Insufficient memory available. >> + * >> + * %-EIO - I/O error >> + * >> + * %-EROFS - Read only filesystem (for create mode) >> + */ >> +int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark, >> + unsigned long *pndirty_blks) >> +{ >> + struct nilfs_sufile_info *sui = NILFS_SUI(sufile); >> + struct nilfs_sufile_cache_node *node; >> + LIST_HEAD(nodes); >> + struct radix_tree_iter iter; >> + void **slot; >> + unsigned long ndirty_blks = 0; >> + int ret = 0; >> + >> + if (!sui->nlive_blks_cache_dirty) >> + goto out; >> + >> + down_write(&NILFS_MDT(sufile)->mi_sem); >> + >> + /* prevent concurrent inserts */ >> + spin_lock(&sui->nlive_blks_cache_lock); >> + radix_tree_for_each_slot(slot, &sui->nlive_blks_cache, &iter, 0) { >> + node = radix_tree_deref_slot_protected(slot, >> + &sui->nlive_blks_cache_lock); >> + if (!node) >> + continue; >> + if (radix_tree_exception(node)) >> + continue; >> + >> + list_add(&node->list_head, &nodes); >> + node->index = iter.index; >> + } >> + if (!only_mark) >> + sui->nlive_blks_cache_dirty = 0; >> + spin_unlock(&sui->nlive_blks_cache_lock); >> + >> + list_for_each_entry(node, &nodes, list_head) { >> + ret = nilfs_sufile_flush_cache_node(sufile, node, only_mark, >> + &ndirty_blks); >> + if (ret) >> + goto out_sem; >> + } >> + >> +out_sem: >> + up_write(&NILFS_MDT(sufile)->mi_sem); >> +out: >> + if (pndirty_blks) >> + *pndirty_blks = ndirty_blks; >> + return ret; >> +} >> + >> +/** >> + * nilfs_sufile_cache_dirty - is the sufile cache dirty >> + * @sufile: inode of segment usage file >> + * >> + * Description: Returns whether the sufile cache is dirty. If this flag is >> + * true, the cache contains unflushed content. >> + * >> + * Return Value: If the cache is not dirty, 0 is returned, otherwise >> + * 1 is returned >> + */ >> +int nilfs_sufile_cache_dirty(struct inode *sufile) >> +{ >> + struct nilfs_sufile_info *sui = NILFS_SUI(sufile); >> + >> + return sui->nlive_blks_cache_dirty; >> +} >> + >> +/** >> + * nilfs_sufile_cache_node_release_rcu - rcu callback function to free nodes >> + * @head: rcu head >> + * >> + * Description: Rcu callback function to free nodes. >> + */ >> +static void nilfs_sufile_cache_node_release_rcu(struct rcu_head *head) >> +{ >> + struct nilfs_sufile_cache_node *node; >> + >> + node = container_of(head, struct nilfs_sufile_cache_node, rcu_head); >> + >> + kmem_cache_free(nilfs_sufile_node_cachep, node); >> +} >> + >> +/** >> + * nilfs_sufile_shrink_cache - free all cache nodes >> + * @sufile: inode of segment usage file >> + * >> + * Description: Frees all cache nodes in the cache regardless of their >> + * content. The content will not be flushed and may be lost. This function >> + * is intended to free up memory after the cache was flushed by >> + * nilfs_sufile_flush_cache(). >> + */ >> +void nilfs_sufile_shrink_cache(struct inode *sufile) >> +{ >> + struct nilfs_sufile_info *sui = NILFS_SUI(sufile); >> + struct nilfs_sufile_cache_node *node; >> + struct radix_tree_iter iter; >> + void **slot; >> + > >> + /* prevent flush form running at the same time */ > > "flush from" ? Yes that is a typo. It should be "flush from". >> + down_read(&NILFS_MDT(sufile)->mi_sem); > > This protection with mi_sem seems to be needless because the current > implementation of nilfs_sufile_shrink_cache() doesn't touch buffers of > sufile. The delete operation is protected by a spinlock and the > counter operations are protected with rcu. What does this > down_read()/up_read() protect ? It is intended to protect against a concurrently running nilfs_sufile_flush_cache() function. This should not happen the way the functions are called currently, but I wanted to make them thread safe. In nilfs_sufile_flush_cache() I use references to nodes outside of the spinlock, which I am not allowed to do if they can be deallocated at any moment. I cannot hold the spinlock for the entire flush, because nilfs_sufile_flush_cache_node() needs to be able to sleep. I cannot use a mutex instead of a spinlock, because this would lead to a potential deadlock: nilfs_sufile_alloc_cache_node(): 1. bmap->b_sem 2. sui->nlive_blks_cache_lock nilfs_sufile_flush_cache_node(): 1. sui->nlive_blks_cache_lock 2. bmap->b_sem So I decided to "abuse" mi_sem for this purpose, since I already need to hold mi_sem in nilfs_sufile_flush_cache(). >> + /* prevent concurrent inserts */ >> + spin_lock(&sui->nlive_blks_cache_lock); >> + >> + radix_tree_for_each_slot(slot, &sui->nlive_blks_cache, &iter, 0) { >> + node = radix_tree_deref_slot_protected(slot, >> + &sui->nlive_blks_cache_lock); >> + if (!node) >> + continue; >> + if (radix_tree_exception(node)) >> + continue; >> + >> + radix_tree_delete(&sui->nlive_blks_cache, iter.index); >> + call_rcu(&node->rcu_head, nilfs_sufile_cache_node_release_rcu); >> + } >> + >> + spin_unlock(&sui->nlive_blks_cache_lock); >> + up_read(&NILFS_MDT(sufile)->mi_sem); >> +} >> + >> +/** >> * nilfs_sufile_read - read or get sufile inode >> * @sb: super block instance >> * @susize: size of a segment usage entry >> @@ -1253,6 +1615,13 @@ int nilfs_sufile_read(struct super_block *sb, size_t susize, >> sui->allocmax = nilfs_sufile_get_nsegments(sufile) - 1; >> sui->allocmin = 0; >> >> + if (nilfs_feature_track_live_blks(sb->s_fs_info)) { >> + bgl_lock_init(&sui->nlive_blks_cache_bgl); >> + spin_lock_init(&sui->nlive_blks_cache_lock); >> + INIT_RADIX_TREE(&sui->nlive_blks_cache, GFP_ATOMIC); >> + } >> + sui->nlive_blks_cache_dirty = 0; >> + >> unlock_new_inode(sufile); >> out: >> *inodep = sufile; > > I think we should introduce destructor to metadata files to prevent > memory leak which is brought by the introduction of the cache nodes > and radix tree. nilfs_sufile_shrink_cache() should be called from the > destructor. > > The destructor (e.g. mi->mi_dtor) should be called from > nilfs_clear_inode() if it isn't set to a NULL value. Initialization > of the destructor will be done in nilfs_xxx_read(). > > In the current patchset, the callsite of nilfs_sufile_shrink_cache() > is well considered, but it's not sufficient. We have to eliminate the > possibility of memory leak completely and clearly. Ok good idea. Regards, Andreas Rohner > Regards, > Ryusuke Konishi > >> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h >> index 520614f..662ab56 100644 >> --- a/fs/nilfs2/sufile.h >> +++ b/fs/nilfs2/sufile.h >> @@ -87,6 +87,11 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs); >> int nilfs_sufile_read(struct super_block *sb, size_t susize, >> struct nilfs_inode *raw_inode, struct inode **inodep); >> int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range); >> +int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum); >> +void nilfs_sufile_shrink_cache(struct inode *sufile); >> +int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark, >> + unsigned long *pndirty_blks); >> +int nilfs_sufile_cache_dirty(struct inode *sufile); >> >> /** >> * nilfs_sufile_scrap - make a segment garbage >> -- >> 2.3.7 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html