RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

"Elliott, Robert (Persistent Memory)" <elliott@xxxxxxx> · Tue, 15 Dec 2015 19:19:58 +0000

> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@xxxxxxxxxxxx] On Behalf
> Of Borislav Petkov
> Sent: Tuesday, December 15, 2015 12:39 PM
> To: Dan Williams <dan.j.williams@xxxxxxxxx>
> Cc: Luck, Tony <tony.luck@xxxxxxxxx>; linux-nvdimm <linux-
> nvdimm@xxxxxxxxxxx>; X86 ML <x86@xxxxxxxxxx>; linux-
> kernel@xxxxxxxxxxxxxxx; Linux MM <linux-mm@xxxxxxxxx>; Andy Lutomirski
> <luto@xxxxxxxxxx>; Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>; Ingo Molnar
> <mingo@xxxxxxxxxx>
> Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to
> recover from machine checks
> 
> On Tue, Dec 15, 2015 at 10:35:49AM -0800, Dan Williams wrote:
> > Correction we have MOVNTDQA, but that requires saving the fpu state
> > and marking the memory as WC, i.e. probably not worth it.
> 
> Not really. Last time I tried an SSE3 memcpy in the kernel like glibc
> does, it wasn't worth it. The enhanced REP; MOVSB is hands down faster.

Reading from NVDIMM, rep movsb is efficient, but it 
fills the CPU caches with the NVDIMM addresses. For
large data moves (not uncommon for storage) this
will crowd out more important cacheable data.

For normal block device reads made through the pmem
block device driver, this CPU cache consumption is
wasteful, since it is unlikely the application will
ask pmem to read the same addresses anytime soon.
Due to the historic long latency of storage devices,
applications don't re-read from storage again; they
save the results.  So, the streaming-load
instructions are beneficial:
* movntdqa (16-byte xmm registers) 
* vmovntdqa (32-byte ymm registers)
* vmovntdqa (64-byte zmm registers)

Dan Williams wrote:
> Correction we have MOVNTDQA, but that requires
> saving the fpu state and marking the memory as WC
> i.e. probably not worth it.

Although the WC memory type is described in the SDM
in the most detail:
    "An implementation may also make use of the
    non-temporal hint associated with this instruction
    if the memory source is WB (write back) memory
    type. ... may optimize cache reads generated by 
    (V)MOVNTDQA on WB memory type to reduce cache 
    evictions."

For applications doing loads from mmap() DAX memory, 
the CPU cache usage could be worthwhile, because
applications expect mmap() regions to consist of
traditional writeback-cached memory and might do
lots of loads/stores.

Writing to the NVDIMM requires either:
* non-temporal stores; or
* normal stores + cache flushes + fences

movnti is OK for small transfers, but these are
better for bulk moves:
* movntdq (16-byte xmm registers)
* vmovntdq (32-byte ymm registers)
* vmovntdq (64-byte zmm registers)

---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href