Re: [RFC][PATCH] Cross Memory Attach

Christopher Yeoh <cyeoh@xxxxxxxxxxx> · Thu, 16 Sep 2010 23:30:45 +0930

On Thu, 16 Sep 2010 11:15:10 +0200
Brice Goglin <Brice.Goglin@xxxxxxxx> wrote:

> Le 16/09/2010 08:32, Brice Goglin a écrit :
> > I am the guy doing KNEM so I can comment on this. The I/OAT part of
> > KNEM was mostly a research topic, it's mostly useless on current
> > machines since the memcpy performance is much larger than I/OAT DMA
> > Engine. We also have an offload model with a kernel thread, but it
> > wasn't used a lot so far. These features can be ignored for the
> > current discussion.
> 
> I've just created a knem branch where I removed all the above, and
> some other stuff that are not necessary for normal users. So it just
> contains the region management code and two commands to copy between
> regions or between a region and some local iovecs.

When I did the original hpcc runs for CMA vs shared mem double copy I
also did some KNEM runs as a bit of a sanity check. The CMA OpenMPI
implementation actually uses the infrastructure KNEM put into the
OpenMPI shared mem btl - thanks for that btw it made things much easier
for me to test CMA.

Interestingly although KNEM and CMA fundamentally are doing very
similar things, at least with hpcc I didn't see as much of a gain with
KNEM as with CMA:

MB/s				
Naturally Ordered	4	8	16	32
Base	1235	935	622	419
CMA	4741	3769	1977	703
KNEM	3362	3091	1857	681

MB/s				
Randomly Ordered	4	8	16	32
Base	1227	947	638	412
CMA	4666	3682	1978	710
KNEM	3348	3050	1883	684

MB/s				
Max Ping Pong	4	8	16	32
Base	2028	1938	1928	1882
CMA	7424	7510	7598	7708
KNEM	5661	5476	6050	6290

I don't know the reason behind the difference - if its something
perculiar to hpcc,  or if there's extra overhead the way that
knem does setup for copying, or if knem wasn't configured
optimally. I haven't done any comparison IMB or NPB runs...

syscall and setup overhead does have some measurable effect - although I
don't have the numbers for it here, neither KNEM nor CMA does quite as
well with hpcc when compared against a hacked version of hpcc  where
everything is declared ahead of time as shared memory so the receiver
can just do a single copy from userspace - which I think is
representative of a theoretical maximum gain from the single copy
approach.

Chris
-- 
cyeoh@xxxxxxxxxx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href