Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Aug 5, 2016 at 8:33 AM, David Laight <David.Laight@xxxxxxxxxx> wrote:
> From: Alexander Duyck
>> Sent: 05 August 2016 16:15
> ...
>> >
>> > interesting idea. Like dma_map 1GB region and then allocate
>> > pages from it only? but the rest of the kernel won't be able
>> > to use them? so only some smaller region then? or it will be
>> > a boot time flag to reserve this pseudo-huge page?
>>
>> Yeah, something like that.  If we were already talking about
>> allocating a pool of pages it might make sense to just setup something
>> like this where you could reserve a 1GB region for a single 10G device
>> for instance.  Then it would make the whole thing much easier to deal
>> with since you would have a block of memory that should perform very
>> well in terms of DMA accesses.
>
> ISTM that the main kernel allocator ought to be keeping a cache
> of pages that are mapped into the various IOMMU.
> This might be a per-driver cache, but could be much wider.
>
> Then if some code wants such a page it can be allocated one that is
> already mapped.
> Under memory pressure the pages could then be reused for other purposes.
>
> ...
>> In the Intel drivers for instance if the frame
>> size is less than 256 bytes we just copy the whole thing out since it
>> is cheaper to just extend the header copy rather than taking the extra
>> hit for get_page/put_page.
>
> How fast is 'rep movsb' (on cached addresses) on recent x86 cpu?
> It might actually be worth unconditionally copying the entire frame
> on those cpus.

The cost for rep movsb on modern x86 is about 1 cycle for every 16
bytes plus some fixed amount of setup time.  The time it usually takes
for something like an atomic operation can vary but it is usually in
the 10s of cycles when you factor in a get_page/put_page which is why
I ended up going with 256 being the upper limit as I recall since that
allowed for the best performance without starting to incur any
penalty.

> A long time ago we found the breakeven point for the copy to be about
> 1kb on sparc mbus/sbus systems - and that might not have been aligning
> the copy.

I wouldn't know about other architectures.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]