RE: Memory Trace project Help

"Niphadkar, Sameer" <Sameer.Niphadkar@xxxxxxxxxxxx> · Tue, 30 Aug 2011 02:41:30 -0400

-----Original Message-----
From: herron.philip@xxxxxxxxxxxxxx [mailto:herron.philip@xxxxxxxxxxxxxx]
On Behalf Of Philip Herron
Sent: Monday, August 29, 2011 9:58 PM
To: Niphadkar, Sameer
Subject: Re: Memory Trace project Help

On 29 August 2011 12:07, Niphadkar, Sameer
<Sameer.Niphadkar@xxxxxxxxxxxx> wrote:
> Hi guys,
>
> I hope to get your valuable inputs to this pet project of mine, please
> do feel free to mention your ideas, suggestions and recommendations
for
> the same.
>
> I've collected a huge number of memory traces almost 10 GB of data.
> These memory traces were gathered from a set of servers, desktops, and
> laptops in a university CS Department. Each trace file contains a list
> of hashes representing the contents of the machine's memory, as well
as
> some meta information about the running processes and OS type.
>
> The traces have been grouped by type and date. Traces were recorded
> approximately every 30 minutes, although if machines were turned off
or
> away from an internet connection for a long period, no traces were
> acquired. Each trace file is split into two portions. The top segment
is
> ASCII text containing the system meta data about operating system type
> and a list of running processes. This is followed by binary data
> containing the list of hashes generated for each page in the system.
> Hashes are stored as consecutive 32bit values. There is a simple tool
> called "traceReader" for extracting the hashes from a trace file. This
> takes as an argument the file to be parsed, and will output the hash
> list as a series of integer values. If you would like to compare to
> traces to estimate the amount of sharing between them, you could run:
>
> ./traceReader trace-x.dat > trace-all
> ./traceReader trace-y.dat >> trace-all
> cat trace-all | sort | uniq -c
>
> This will tell you the number of times that each hash occurs in the
> system.
>
> Now my idea is to take the trace for every interval (every 30 mins)
for
> each of the systems and find the frequency of each memory hash. I then
> plan to collect the highest frequencies (hashes maximally occurring)
of
> the entire hour (60 mins) and then divide the memory into 'k'
different
> patterns based on the count of these frequencies. Like for instance
say
> hashes 14F430C8 ,1550068, 15AD480A, 161384B6, 16985213, 17CA274B,
> 18E5F038 and 1A3329 have the highest frequencies then I might divide
the
> memory into 8 patterns (k=8). I plan to use the Approximate Nearest
> neighbor algorithm (ANN) http://www.cs.umd.edu/~mount/ANN/ for this
> division. In ANN one needs to provide a set of query points, data
points
> and dimensions. I guess in my case my query points can be all the
> remaining hashes other than the highest frequency ones, the data
points
> are all the hashes for the hour and dimension can be 1. I can thus
> formulate the memory patterns for every hour, I then plan to formulate
> memory patterns for every 3 hrs, 6 hrs, 12 hrs and finally all the 24
> hrs. Armed with these statistics, I plan to compare the patterns based
> on the time of the day. I hope to provide certain overlap with the
> patterns and create what I call as "heat zones" for memory based on
the
> time of the day and finally come up with a suitable report for the
same.
>
>
> The entire objective of this project is to provide a sort of relation
> between the memory page access and the interval of time of the day. So
> for specific intervals there are certain memory "heat zones". I
> understand that these "heat zones" might change and may not be
> consistent with every system and user. The study here intends to only
> establish this relationship and doesn't do any kind of qualitative or
> quantitative analysis of these heat zones per system and user. The
above
> can be considered to be an extension of this work.
>
> Please feel free to comment and suggest for any new insights

Maybe i missed something but i am not sure why you posted this to
gcc-help. And i not quite sure what the reason behind doing this
really is.  If your looking at what pages are being accessed at
different intervals during the day are you looking up specific
addresses or looking for specific data within these?

Are you in the end looking to find something like: computers x through
z run program A generally at this time interval but computers f
through i run program B at the same interval. but then you can just
look up process list. Or is it your looking for specific addresses in
memory which are being accessed more frequently. But this will change
from computer to computer. And even if you say find with operating
system A you find that these addresses are being most frequently
accessed can you really make any assumptions about it

I'm sorry to have posted it to the wrong list - I intended to post this
message to the gcc-developer list as I believe it to have some good
experts on memory analytics and management. 

As pointed out there are two aspects to this project. 

1. Find out about the processes running most frequently at a particular
time interval on different systems (this may be an easier option)
2. Go deeper to the physical memory(PM) trace and find the relationship
between the PM addresses and most frequent access per universal time
clock per system. 

I understand that with address space randomized mappings and with
different systems running different processes it might be very hard to
find any suitable pattern emerging from this study. But as most of us
know that identical systems belonging in a particular network and during
a time frame might end up accessing similar PM blocks. (A block here
being groups of pages) I intend to  find if there is any kind of
correlation between this time frame and the access. According to the
working set model of a system, there exits a temporal and spatial
locality of memory page access and hence we end up using the appropriate
page replacement algorithms. Now I intent to see if this same analogy
can be applied to the entire memory address space  for access. I mean if
there exists some sort of a pattern emerging for physical memory access
based on time and space. 

I hope to know if there has been any similar work done before with
memory traces or if there are any other areas which I need to look into
before I can begin this study.