RE: Optimizing mmap_queue on AVX/AVX2 CPUs

"Elliott, Robert (Persistent Memory)" <elliott@xxxxxxx> · Wed, 30 Aug 2017 20:57:46 +0000

> -----Original Message-----
> From: fio-owner@xxxxxxxxxxxxxxx [mailto:fio-owner@xxxxxxxxxxxxxxx] On Behalf
> Of Jens Axboe
> Sent: Tuesday, August 29, 2017 10:33 AM
> To: Rebecca Cran <rebecca@xxxxxxxxxxxx>; fio@xxxxxxxxxxxxxxx
> Subject: Re: Optimizing mmap_queue on AVX/AVX2 CPUs
> 
> On 08/25/2017 07:46 PM, Rebecca Cran wrote:
> > I'm not sure how far we want to get into optimizing fio for specific CPUs?
> >
> > I've done some testing and found that when running the mmap ioengine
> > against an NVDIMM-N on a modern Intel CPU I can gain a few hundred MB/s
> > by optimizing the memory copy using avx/avx2 versus the system's memcpy
> > implementation.
> >
> >
> > Should I proceed with submitting a patch, or do we want to avoid getting
> > into these sort of optimizations?
> 
> If we can do it cleanly, that's fine. See for instance how we detect
> presence of crc32c hw assist at init time.
> 
> For memcpy(), the libc functions should really be doing this, however.

Unfortunately, the glibc memcpy() implementation changes fairly often;
some versions use rep movsb, others have attempted to use xmm, ymm,
and zmm registers.  So, having more control in fio would help simulate
methods (both good and bad) that are used by different applications
and library versions.

There's even a new patch set to use the Intel QuickData DMA engines 
for transfers rather than the CPU (a "blkmq" pmem driver).  It'd be
interesting if fio could use that hardware too (with direct access by
fio, not resorting to kernel read()/write() calls).

---
Robert Elliott, HPE Persistent Memory

��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�