> -----Original Message----- > From: Linux-nvdimm [mailto:linux-nvdimm-bounces@xxxxxxxxxxxx] On Behalf > Of Borislav Petkov > Sent: Tuesday, December 15, 2015 12:39 PM > To: Dan Williams <dan.j.williams@xxxxxxxxx> > Cc: Luck, Tony <tony.luck@xxxxxxxxx>; linux-nvdimm <linux- > nvdimm@xxxxxxxxxxx>; X86 ML <x86@xxxxxxxxxx>; linux- > kernel@xxxxxxxxxxxxxxx; Linux MM <linux-mm@xxxxxxxxx>; Andy Lutomirski > <luto@xxxxxxxxxx>; Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>; Ingo Molnar > <mingo@xxxxxxxxxx> > Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to > recover from machine checks > > On Tue, Dec 15, 2015 at 10:35:49AM -0800, Dan Williams wrote: > > Correction we have MOVNTDQA, but that requires saving the fpu state > > and marking the memory as WC, i.e. probably not worth it. > > Not really. Last time I tried an SSE3 memcpy in the kernel like glibc > does, it wasn't worth it. The enhanced REP; MOVSB is hands down faster. Reading from NVDIMM, rep movsb is efficient, but it fills the CPU caches with the NVDIMM addresses. For large data moves (not uncommon for storage) this will crowd out more important cacheable data. For normal block device reads made through the pmem block device driver, this CPU cache consumption is wasteful, since it is unlikely the application will ask pmem to read the same addresses anytime soon. Due to the historic long latency of storage devices, applications don't re-read from storage again; they save the results. So, the streaming-load instructions are beneficial: * movntdqa (16-byte xmm registers) * vmovntdqa (32-byte ymm registers) * vmovntdqa (64-byte zmm registers) Dan Williams wrote: > Correction we have MOVNTDQA, but that requires > saving the fpu state and marking the memory as WC > i.e. probably not worth it. Although the WC memory type is described in the SDM in the most detail: "An implementation may also make use of the non-temporal hint associated with this instruction if the memory source is WB (write back) memory type. ... may optimize cache reads generated by (V)MOVNTDQA on WB memory type to reduce cache evictions." For applications doing loads from mmap() DAX memory, the CPU cache usage could be worthwhile, because applications expect mmap() regions to consist of traditional writeback-cached memory and might do lots of loads/stores. Writing to the NVDIMM requires either: * non-temporal stores; or * normal stores + cache flushes + fences movnti is OK for small transfers, but these are better for bulk moves: * movntdq (16-byte xmm registers) * vmovntdq (32-byte ymm registers) * vmovntdq (64-byte zmm registers) --- Robert Elliott, HPE Persistent Memory -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href