Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

Gi-Oh Kim <gi-oh.kim@xxxxxxxxxxxxxxxx> · Wed, 16 May 2018 12:10:20 +0200

On Wed, May 16, 2018 at 9:04 AM, Tariq Toukan <tariqt@xxxxxxxxxxxx> wrote:
>
>
> On 15/05/2018 9:53 PM, Qing Huang wrote:
>>
>>
>>
>> On 5/15/2018 2:19 AM, Tariq Toukan wrote:
>>>
>>>
>>>
>>> On 14/05/2018 7:41 PM, Qing Huang wrote:
>>>>
>>>>
>>>>
>>>> On 5/13/2018 2:00 AM, Tariq Toukan wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 11/05/2018 10:23 PM, Qing Huang wrote:
>>>>>>
>>>>>> When a system is under memory presure (high usage with fragments),
>>>>>> the original 256KB ICM chunk allocations will likely trigger kernel
>>>>>> memory management to enter slow path doing memory compact/migration
>>>>>> ops in order to complete high order memory allocations.
>>>>>>
>>>>>> When that happens, user processes calling uverb APIs may get stuck
>>>>>> for more than 120s easily even though there are a lot of free pages
>>>>>> in smaller chunks available in the system.
>>>>>>
>>>>>> Syslog:
>>>>>> ...
>>>>>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
>>>>>> oracle_205573_e:205573 blocked for more than 120 seconds.
>>>>>> ...
>>>>>>
>>>>>> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
>>>>>>
>>>>>> However in order to support smaller ICM chunk size, we need to fix
>>>>>> another issue in large size kcalloc allocations.
>>>>>>
>>>>>> E.g.
>>>>>> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
>>>>>> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each
>>>>>> mtt
>>>>>> entry). So we need a 16MB allocation for a table->icm pointer array to
>>>>>> hold 2M pointers which can easily cause kcalloc to fail.
>>>>>>
>>>>>> The solution is to use vzalloc to replace kcalloc. There is no need
>>>>>> for contiguous memory pages for a driver meta data structure (no need
>>>>>> of DMA ops).
>>>>>>
>>>>>> Signed-off-by: Qing Huang <qing.huang@xxxxxxxxxx>
>>>>>> Acked-by: Daniel Jurgens <danielj@xxxxxxxxxxxx>
>>>>>> Reviewed-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx>
>>>>>> ---
>>>>>> v2 -> v1: adjusted chunk size to reflect different architectures.
>>>>>>
>>>>>>   drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++-------
>>>>>>   1 file changed, 7 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>>>> b/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>>>> index a822f7a..ccb62b8 100644
>>>>>> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>>>> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>>>> @@ -43,12 +43,12 @@
>>>>>>   #include "fw.h"
>>>>>>     /*
>>>>>> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
>>>>>> - * per chunk.
>>>>>> + * We allocate in page size (default 4KB on many archs) chunks to
>>>>>> avoid high
>>>>>> + * order memory allocations in fragmented/high usage memory
>>>>>> situation.
>>>>>>    */
>>>>>>   enum {
>>>>>> -    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
>>>>>> -    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
>>>>>> +    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
>>>>>> +    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT
>>>>>
>>>>>
>>>>> Which is actually PAGE_SIZE.
>>>>
>>>>
>>>> Yes, we wanted to avoid high order memory allocations.
>>>>
>>>
>>> Then please use PAGE_SIZE instead.
>>
>>
>> PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT is
>> actually more appropriate here.
>>
>
> Definition of PAGE_SIZE varies among different archs.
> It is not always as simple as 1 << PAGE_SHIFT.
> It might be:
> PAGE_SIZE (1UL << PAGE_SHIFT)
> PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
> etc...
>
> Please replace 1 << PAGE_SHIFT with PAGE_SIZE.
>
>>
>>>
>>>>> Also, please add a comma at the end of the last entry.
>>>>
>>>>
>>>> Hmm..., followed the existing code style and checkpatch.pl didn't
>>>> complain about the comma.
>>>>
>>>
>>> I am in favor of having a comma also after the last element, so that when
>>> another enum element is added we do not modify this line again, which would
>>> falsely affect git blame.
>>>
>>> I know it didn't exist before your patch, but once we're here, let's do
>>> it.
>>
>>
>> I'm okay either way. If adding an extra comma is preferred by many people,
>> someone should update checkpatch.pl to enforce it. :)
>>
> I agree.
> Until then, please use an extra comma in this patch.
>
>>>
>>>>>
>>>>>>   };
>>>>>>     static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct
>>>>>> mlx4_icm_chunk *chunk)
>>>>>> @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev,
>>>>>> struct mlx4_icm_table *table,
>>>>>>       obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
>>>>>>       num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>>>>>>   -    table->icm      = kcalloc(num_icm, sizeof(*table->icm),
>>>>>> GFP_KERNEL);
>>>>>> +    table->icm      = vzalloc(num_icm * sizeof(*table->icm));
>>>>>
>>>>>
>>>>> Why not kvzalloc ?
>>>>
>>>>
>>>> I think table->icm pointer array doesn't really need physically
>>>> contiguous memory. Sometimes high order
>>>> memory allocation by kmalloc variants may trigger slow path and cause
>>>> tasks to be blocked.
>>>>
>>>
>>> This is control path so it is less latency-sensitive.
>>> Let's not produce unnecessary degradation here, please call kvzalloc so
>>> we maintain a similar behavior when contiguous memory is available, and a
>>> fallback for resiliency.
>>
>>
>> No sure what exactly degradation is caused by vzalloc here. I think it's
>> better to keep physically contiguous pages
>> to other requests which really need them. Besides slow path/mem compacting
>> can be really expensive.
>>
>
> Degradation is expected when you replace a contig memory with non-contig
> memory, without any perf test.
> We agree that when contig memory is not available, we should use non-contig
> instead of simply failing, and for this you can call kvzalloc.

The expected degradation would be little if the data is not very
performance sensitive.
I think vmalloc would be better in general case.

Even if the server has hundreds of gigabyte memory, even 1MB
contiguous memory is often rare.
For example, I attached /proc/pagetypeinfo of my server running for 153 days.
The largest contiguous memory is 2^7=512KB.

Node    0, zone   Normal, type    Unmovable   4499   9418   4817
732    747    567     18      3      0      0      0
Node    0, zone   Normal, type      Movable  38179  40839  10546
1888    491     51      1      0      0      0      0
Node    0, zone   Normal, type  Reclaimable    117     98   1279
35     21     10      8      0      0      0      0
Node    0, zone   Normal, type   HighAtomic      0      0      0
0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Isolate      0      0      0
0      0      0      0      0      0      0      0

So I think vmalloc would be good if it is not very performance critical data.

-- 
GIOH KIM
Linux Kernel Entwickler

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 176 2697 8962
Fax:      +49 30 577 008 299
Email:    gi-oh.kim@xxxxxxxxxxxxxxxx
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html