On Wed, May 16, 2018 at 9:04 AM, Tariq Toukan <tariqt@xxxxxxxxxxxx> wrote: > > > On 15/05/2018 9:53 PM, Qing Huang wrote: >> >> >> >> On 5/15/2018 2:19 AM, Tariq Toukan wrote: >>> >>> >>> >>> On 14/05/2018 7:41 PM, Qing Huang wrote: >>>> >>>> >>>> >>>> On 5/13/2018 2:00 AM, Tariq Toukan wrote: >>>>> >>>>> >>>>> >>>>> On 11/05/2018 10:23 PM, Qing Huang wrote: >>>>>> >>>>>> When a system is under memory presure (high usage with fragments), >>>>>> the original 256KB ICM chunk allocations will likely trigger kernel >>>>>> memory management to enter slow path doing memory compact/migration >>>>>> ops in order to complete high order memory allocations. >>>>>> >>>>>> When that happens, user processes calling uverb APIs may get stuck >>>>>> for more than 120s easily even though there are a lot of free pages >>>>>> in smaller chunks available in the system. >>>>>> >>>>>> Syslog: >>>>>> ... >>>>>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task >>>>>> oracle_205573_e:205573 blocked for more than 120 seconds. >>>>>> ... >>>>>> >>>>>> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. >>>>>> >>>>>> However in order to support smaller ICM chunk size, we need to fix >>>>>> another issue in large size kcalloc allocations. >>>>>> >>>>>> E.g. >>>>>> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk >>>>>> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each >>>>>> mtt >>>>>> entry). So we need a 16MB allocation for a table->icm pointer array to >>>>>> hold 2M pointers which can easily cause kcalloc to fail. >>>>>> >>>>>> The solution is to use vzalloc to replace kcalloc. There is no need >>>>>> for contiguous memory pages for a driver meta data structure (no need >>>>>> of DMA ops). >>>>>> >>>>>> Signed-off-by: Qing Huang <qing.huang@xxxxxxxxxx> >>>>>> Acked-by: Daniel Jurgens <danielj@xxxxxxxxxxxx> >>>>>> Reviewed-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx> >>>>>> --- >>>>>> v2 -> v1: adjusted chunk size to reflect different architectures. >>>>>> >>>>>> drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++------- >>>>>> 1 file changed, 7 insertions(+), 7 deletions(-) >>>>>> >>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c >>>>>> b/drivers/net/ethernet/mellanox/mlx4/icm.c >>>>>> index a822f7a..ccb62b8 100644 >>>>>> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c >>>>>> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c >>>>>> @@ -43,12 +43,12 @@ >>>>>> #include "fw.h" >>>>>> /* >>>>>> - * We allocate in as big chunks as we can, up to a maximum of 256 KB >>>>>> - * per chunk. >>>>>> + * We allocate in page size (default 4KB on many archs) chunks to >>>>>> avoid high >>>>>> + * order memory allocations in fragmented/high usage memory >>>>>> situation. >>>>>> */ >>>>>> enum { >>>>>> - MLX4_ICM_ALLOC_SIZE = 1 << 18, >>>>>> - MLX4_TABLE_CHUNK_SIZE = 1 << 18 >>>>>> + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, >>>>>> + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT >>>>> >>>>> >>>>> Which is actually PAGE_SIZE. >>>> >>>> >>>> Yes, we wanted to avoid high order memory allocations. >>>> >>> >>> Then please use PAGE_SIZE instead. >> >> >> PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT is >> actually more appropriate here. >> > > Definition of PAGE_SIZE varies among different archs. > It is not always as simple as 1 << PAGE_SHIFT. > It might be: > PAGE_SIZE (1UL << PAGE_SHIFT) > PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT) > etc... > > Please replace 1 << PAGE_SHIFT with PAGE_SIZE. > >> >>> >>>>> Also, please add a comma at the end of the last entry. >>>> >>>> >>>> Hmm..., followed the existing code style and checkpatch.pl didn't >>>> complain about the comma. >>>> >>> >>> I am in favor of having a comma also after the last element, so that when >>> another enum element is added we do not modify this line again, which would >>> falsely affect git blame. >>> >>> I know it didn't exist before your patch, but once we're here, let's do >>> it. >> >> >> I'm okay either way. If adding an extra comma is preferred by many people, >> someone should update checkpatch.pl to enforce it. :) >> > I agree. > Until then, please use an extra comma in this patch. > >>> >>>>> >>>>>> }; >>>>>> static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct >>>>>> mlx4_icm_chunk *chunk) >>>>>> @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, >>>>>> struct mlx4_icm_table *table, >>>>>> obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; >>>>>> num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; >>>>>> - table->icm = kcalloc(num_icm, sizeof(*table->icm), >>>>>> GFP_KERNEL); >>>>>> + table->icm = vzalloc(num_icm * sizeof(*table->icm)); >>>>> >>>>> >>>>> Why not kvzalloc ? >>>> >>>> >>>> I think table->icm pointer array doesn't really need physically >>>> contiguous memory. Sometimes high order >>>> memory allocation by kmalloc variants may trigger slow path and cause >>>> tasks to be blocked. >>>> >>> >>> This is control path so it is less latency-sensitive. >>> Let's not produce unnecessary degradation here, please call kvzalloc so >>> we maintain a similar behavior when contiguous memory is available, and a >>> fallback for resiliency. >> >> >> No sure what exactly degradation is caused by vzalloc here. I think it's >> better to keep physically contiguous pages >> to other requests which really need them. Besides slow path/mem compacting >> can be really expensive. >> > > Degradation is expected when you replace a contig memory with non-contig > memory, without any perf test. > We agree that when contig memory is not available, we should use non-contig > instead of simply failing, and for this you can call kvzalloc. The expected degradation would be little if the data is not very performance sensitive. I think vmalloc would be better in general case. Even if the server has hundreds of gigabyte memory, even 1MB contiguous memory is often rare. For example, I attached /proc/pagetypeinfo of my server running for 153 days. The largest contiguous memory is 2^7=512KB. Node 0, zone Normal, type Unmovable 4499 9418 4817 732 747 567 18 3 0 0 0 Node 0, zone Normal, type Movable 38179 40839 10546 1888 491 51 1 0 0 0 0 Node 0, zone Normal, type Reclaimable 117 98 1279 35 21 10 8 0 0 0 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 So I think vmalloc would be good if it is not very performance critical data. -- GIOH KIM Linux Kernel Entwickler ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 176 2697 8962 Fax: +49 30 577 008 299 Email: gi-oh.kim@xxxxxxxxxxxxxxxx URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html