On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote: > On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@xxxxxxx> wrote: > > On Mon 16-11-15 14:37:14, Jan Kara wrote: > [..] > > But a question: Won't it be better to do sfence + pcommit only in response > > to REQ_FLUSH request and don't do it after each write? I'm not sure how > > expensive these instructions are but in theory it could be a performance > > win, couldn't it? For filesystems this is enough wrt persistency > > guarantees... > > We would need to gather the performance data... The expectation is > that the cache flushing is more expensive than the sfence + pcommit. I think we should revisit the idea of removing wmb_pmem() from the I/O path in both the PMEM driver and in DAX, and just relying on the REQ_FUA/REQ_FLUSH path to do wmb_pmem() for all cases. This was brought up in the thread dealing with the "big hammer" fsync/msync patches as well. https://lkml.org/lkml/2015/11/3/730 I think we can all agree from the start that wmb_pmem() will have a nonzero cost, both because of the PCOMMIT and because of the ordering caused by the sfence. If it's possible to avoid doing it on each I/O, I think that would be a win. So, here would be our new flows: PMEM I/O: write I/O(s) to the driver PMEM I/O writes the data using non-temporal stores REQ_FUA/REQ_FLUSH to the PMEM driver wmb_pmem() to order all previous writes and flushes, and to PCOMMIT the dirty data durably to the DIMMs DAX I/O: write I/O(s) to the DAX layer write the data using regular stores (eventually to be replaced with non-temporal stores) flush the data with wb_cache_pmem() (removed when we use non-temporal stores) REQ_FUA/REQ_FLUSH to the PMEM driver wmb_pmem() to order all previous writes and flushes, and to PCOMMIT the dirty data durably to the DIMMs DAX msync/fsync: writes happen to DAX mmaps from userspace DAX fsync/msync all dirty pages are written back using wb_cache_pmem() REQ_FUA/REQ_FLUSH to the PMEM driver wmb_pmem() to order all previous writes and flushes, and to PCOMMIT the dirty data durably to the DIMMs DAX/PMEM zeroing (suggested by Dave: https://lkml.org/lkml/2015/11/2/772): PMEM driver receives zeroing request writes a bunch of zeroes using non-temporal stores REQ_FUA/REQ_FLUSH to the PMEM driver wmb_pmem() to order all previous writes and flushes, and to PCOMMIT the dirty data durably to the DIMMs Having all these flows wait to do wmb_pmem() in the PMEM driver in response to REQ_FUA/REQ_FLUSH has several advantages: 1) The work done and guarantees provided after each step closely match the normal block I/O to disk case. This means that the existing algorithms used by filesystems to make sure that their metadata is ordered properly and synced at a known time should all work the same. 2) By delaying wmb_pmem() until REQ_FUA/REQ_FLUSH time we can potentially do many I/Os at different levels, and order them all with a single wmb_pmem(). This should result in a performance win. Is there any reason why this wouldn't work or wouldn't be a good idea? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html