Re: [RFC v4+ hot_track 09/19] vfs: add one work queue

Zhi Yong Wu <zwu.kernel@xxxxxxxxx> · Mon, 5 Nov 2012 20:20:27 +0800



On Mon, Nov 5, 2012 at 8:07 PM, Steven Whitehouse <swhiteho@xxxxxxxxxx> wrote:
> Hi,
>
> On Mon, 2012-11-05 at 19:55 +0800, Zhi Yong Wu wrote:
>> On Mon, Nov 5, 2012 at 7:21 PM, Steven Whitehouse <swhiteho@xxxxxxxxxx> wrote:
>> > Hi,
>> >
>> > On Mon, 2012-10-29 at 12:30 +0800, zwu.kernel@xxxxxxxxx wrote:
>> >> From: Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx>
>> >>
>> >>   Add a per-superblock workqueue and a delayed_work
>> >> to run periodic work to update map info on each superblock.
>> >>
>> >> Signed-off-by: Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx>
>> >> ---
>> >>  fs/hot_tracking.c            |   85 ++++++++++++++++++++++++++++++++++++++++++
>> >>  fs/hot_tracking.h            |    3 +
>> >>  include/linux/hot_tracking.h |    3 +
>> >>  3 files changed, 91 insertions(+), 0 deletions(-)
>> >>
>> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
>> >> index fff0038..0ef9cad 100644
>> >> --- a/fs/hot_tracking.c
>> >> +++ b/fs/hot_tracking.c
>> >> @@ -15,9 +15,12 @@
>> >>  #include <linux/module.h>
>> >>  #include <linux/spinlock.h>
>> >>  #include <linux/hardirq.h>
>> >> +#include <linux/kthread.h>
>> >> +#include <linux/freezer.h>
>> >>  #include <linux/fs.h>
>> >>  #include <linux/blkdev.h>
>> >>  #include <linux/types.h>
>> >> +#include <linux/list_sort.h>
>> >>  #include <linux/limits.h>
>> >>  #include "hot_tracking.h"
>> >>
>> >> @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root)
>> >>       }
>> >>  }
>> >>
>> >> +/* Temperature compare function*/
>> >> +static int hot_temp_cmp(void *priv, struct list_head *a,
>> >> +                             struct list_head *b)
>> >> +{
>> >> +     struct hot_comm_item *ap =
>> >> +                     container_of(a, struct hot_comm_item, n_list);
>> >> +     struct hot_comm_item *bp =
>> >> +                     container_of(b, struct hot_comm_item, n_list);
>> >> +
>> >> +     int diff = ap->hot_freq_data.last_temp
>> >> +                             - bp->hot_freq_data.last_temp;
>> >> +     if (diff > 0)
>> >> +             return -1;
>> >> +     if (diff < 0)
>> >> +             return 1;
>> >> +     return 0;
>> >> +}
>> >> +
>> >> +/*
>> >> + * Every sync period we update temperatures for
>> >> + * each hot inode item and hot range item for aging
>> >> + * purposes.
>> >> + */
>> >> +static void hot_update_worker(struct work_struct *work)
>> >> +{
>> >> +     struct hot_info *root = container_of(to_delayed_work(work),
>> >> +                                     struct hot_info, update_work);
>> >> +     struct hot_inode_item *hi_nodes[8];
>> >> +     u64 ino = 0;
>> >> +     int i, n;
>> >> +
>> >> +     while (1) {
>> >> +             n = radix_tree_gang_lookup(&root->hot_inode_tree,
>> >> +                                (void **)hi_nodes, ino,
>> >> +                                ARRAY_SIZE(hi_nodes));
>> >> +             if (!n)
>> >> +                     break;
>> >> +
>> >> +             ino = hi_nodes[n - 1]->i_ino + 1;
>> >> +             for (i = 0; i < n; i++) {
>> >> +                     kref_get(&hi_nodes[i]->hot_inode.refs);
>> >> +                     hot_map_array_update(
>> >> +                             &hi_nodes[i]->hot_inode.hot_freq_data, root);
>> >> +                     hot_range_update(hi_nodes[i], root);
>> >> +                     hot_inode_item_put(hi_nodes[i]);
>> >> +             }
>> >> +     }
>> >> +
>> >> +     /* Sort temperature map info */
>> >> +     for (i = 0; i < HEAT_MAP_SIZE; i++) {
>> >> +             list_sort(NULL, &root->heat_inode_map[i].node_list,
>> >> +                     hot_temp_cmp);
>> >> +             list_sort(NULL, &root->heat_range_map[i].node_list,
>> >> +                     hot_temp_cmp);
>> >> +     }
>> >> +
>> >
>> > If this list can potentially have one (or more) entries per inode, then
>> Only one hot_inode_item per inode, while maybe multiple
>> hot_range_items per inode.
>> > filesystems with a lot of inodes (millions) may potentially exceed the
>> > max size of list which list_sort() can handle. If that happens it still
>> > works, but you'll get a warning message and it won't be as efficient.
>> I haven't do so large scale test. If we want to find that issue, we
>> need to do large scale performance test, before that, i want to make
>> sure the code change is correct at first.
>> To be honest, for that issue you pointed to, i also have such
>> concern.But list_sort() performance looks good from the test result of
>> the following URL:
>> https://lkml.org/lkml/2010/1/20/485
>>
> Yes, I think it is good. Also, even when it says that it's performance
> is poor (via the warning message) it is still much better than the
> alternative (of not sorting) in the GFS2 case. So currently our
> workaround is to ignore the warning. Due to what we using it for
> (sorting the data blocks for ordered writeback) we only see it very
> occasionally when there has been lots of data write activity with little
> journal activity on a node with lots of RAM.
OK.
>
>> >
>> > It is something that we've run into with list_sort() and GFS2, but it
>> > only happens very rarely,
>> Beside list_sort(), do you have any other way to share? For this
>> concern, how does GFS2 resolve it?
>>
> That is an ongoing investigation :-)
>
> I've pondered various options... increase temp variable space in
> list_sort(), not using list_sort() and insertion sorting the blocks
> instead, flushing the ordered write data early if the list gets too
> long, figuring out how to remove blocks written back by the VM from the
> list before the sort, and various other possible solutions. So far I'm
> not sure which will be the best to choose, and since your situation is a
> bit different it might not make sense to use the same solution.
>
> I just thought it was worth mentioning though since it was something
> that we'd run across,
thanks for your experience share. anyway, thanks.

By the way, it will be appreciated if you can comment on other patches.
>
> Steve.
>
>


-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html