Re: [PATCH 1/5] md: add a 'md_numa_node' module parameter

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



2017-07-11 1:28 GMT+08:00 Shaohua Li <shli@xxxxxxxxxx>:
> On Sat, Jul 01, 2017 at 03:20:54PM +0800, Zhengyuan Liu wrote:
>> This module parameter allows user to control which numa node
>> the memory for mainly structures (e.g. mddev, md_rdev,
>> request_queue, gendisk) is allocated from. The idea came from
>> commit 115485e83f49 ("dm: add 'dm_numa_node' module parameter").
>> If we don't set this parameter while loading module, it won't
>> affect the md default behavior.
>>
>> numa_node_id field is added so personality such as raid0 could
>> inherit the node from mddev, for other personality which has its
>> own handling thread we could add such a module parameter too.
>>
>> This patch includes a bug fix about parameter range check suggested
>> by Coly Li <colyli@xxxxxxx>.
>>
>> Signed-off-by: Zhengyuan Liu <liuzhengyuan@xxxxxxxxxx>
>
> Can you give an example why this is useful?

To be honest, I do't catch clearly performance improvement on my arm64 numa
server with those patch sets. But, theoretically it should make sense especially
for direct IO  when all memory allocation binding to a local numa node through
the IO path.  I have did a fio test based on raid0 which created from two raw device,
of course I added a numa module parameter to drivers/block/rbd.c too,  the iops
got a bit better (about 2%) when md and brd bind to a same node comparing to
no module parameter, but not always better since numa is handled transparently
by memory management in most case.  


>
> If we have several raid arrays, we can't control numa node for them separately.

Yes, current patch doesn't consider this scenario. To control them fall into different
nuna node separately, we could make the module parameter writable just like
commit 115485e83f49 does.

> And in the patch 4/5, one array could have two different parameters to control
> memory allocation. this is weird. I have no idea how users use this feature.
>
> If we really want to support numa node, would it be better to deduce the numa
> node from the combination of all underlayer disks?

I think it is feasible, as we could get the device numa node of all md_rdev when creating.
If any two md_rdev get different numa node property, then we can stop keeping going.

Thanks,
Zhengyuan

>
> Thanks,
> Shaohua
>
>> ---
>>  drivers/md/md.c | 18 ++++++++++++++----
>>  drivers/md/md.h |  2 ++
>>  2 files changed, 16 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 092b48f..fae93db 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -184,6 +184,8 @@ static int start_readonly;
>>   */
>>  static bool create_on_open = true;
>>
>> +static int md_numa_node = NUMA_NO_NODE;
>> +
>>  /* bio_clone_mddev
>>   * like bio_clone, but with a local bio set
>>   */
>> @@ -588,7 +590,7 @@ static struct mddev *mddev_find(dev_t unit)
>>       }
>>       spin_unlock(&all_mddevs_lock);
>>
>> -     new = kzalloc(sizeof(*new), GFP_KERNEL);
>> +     new = kzalloc_node(sizeof(*new), GFP_KERNEL, md_numa_node);
>>       if (!new)
>>               return NULL;
>>
>> @@ -598,6 +600,7 @@ static struct mddev *mddev_find(dev_t unit)
>>       else
>>               new->md_minor = MINOR(unit) >> MdpMinorShift;
>>
>> +     new->numa_node_id = md_numa_node;
>>       mddev_init(new);
>>
>>       goto retry;
>> @@ -3387,7 +3390,7 @@ static struct md_rdev *md_import_device(dev_t newdev, int super_format, int supe
>>       struct md_rdev *rdev;
>>       sector_t size;
>>
>> -     rdev = kzalloc(sizeof(*rdev), GFP_KERNEL);
>> +     rdev = kzalloc_node(sizeof(*rdev), GFP_KERNEL, md_numa_node);
>>       if (!rdev)
>>               return ERR_PTR(-ENOMEM);
>>
>> @@ -5256,7 +5259,7 @@ static int md_alloc(dev_t dev, char *name)
>>               mddev->hold_active = UNTIL_STOP;
>>
>>       error = -ENOMEM;
>> -     mddev->queue = blk_alloc_queue(GFP_KERNEL);
>> +     mddev->queue = blk_alloc_queue_node(GFP_KERNEL, md_numa_node);
>>       if (!mddev->queue)
>>               goto abort;
>>       mddev->queue->queuedata = mddev;
>> @@ -5264,7 +5267,7 @@ static int md_alloc(dev_t dev, char *name)
>>       blk_queue_make_request(mddev->queue, md_make_request);
>>       blk_set_stacking_limits(&mddev->queue->limits);
>>
>> -     disk = alloc_disk(1 << shift);
>> +     disk = alloc_disk_node(1 << shift, md_numa_node);
>>       if (!disk) {
>>               blk_cleanup_queue(mddev->queue);
>>               mddev->queue = NULL;
>> @@ -8969,6 +8972,11 @@ static int __init md_init(void)
>>       register_reboot_notifier(&md_notifier);
>>       raid_table_header = register_sysctl_table(raid_root_table);
>>
>> +     if (md_numa_node < NUMA_NO_NODE)
>> +             md_numa_node = NUMA_NO_NODE;
>> +     else if (md_numa_node > (num_online_nodes() - 1))
>> +             md_numa_node = num_online_nodes() - 1;
>> +
>>       md_geninit();
>>       return 0;
>>
>> @@ -9252,6 +9260,8 @@ module_param_call(start_ro, set_ro, get_ro, NULL, S_IRUSR|S_IWUSR);
>>  module_param(start_dirty_degraded, int, S_IRUGO|S_IWUSR);
>>  module_param_call(new_array, add_named_array, NULL, NULL, S_IWUSR);
>>  module_param(create_on_open, bool, S_IRUSR|S_IWUSR);
>> +module_param(md_numa_node, int, S_IRUGO);
>> +MODULE_PARM_DESC(md_numa_node, "NUMA node for md device memory allocations");
>>
>>  MODULE_LICENSE("GPL");
>>  MODULE_DESCRIPTION("MD RAID framework");
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 991f0fe..5c91033 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -459,6 +459,8 @@ struct mddev {
>>       void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
>>       struct md_cluster_info          *cluster_info;
>>       unsigned int                    good_device_nr; /* good device num within cluster raid */
>> +
>> +     int numa_node_id;
>>  };
>>
>>  enum recovery_flags {
>> --
>> 2.7.4
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html��.n��������+%����;��w��{.n�����{����w��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux