Async behind the scenes.

"Krzysztof A. Adamski" <krzysztofa@xxxxxxxxx> · Thu, 7 Aug 2008 19:55:05 +0200

I've been doing quite a lot of work with async but even then, it took 
significant amount of my time to track recent bugs in async code. I had
to go thru the code from the beginning to the end one more time to get
better understanding because I just forgot about some things. After that,
I realised that all this async magic is not obvious so it could be useful
to write some informations about it (so other people don't have to do the
job I've already done to understand this). Hopefully this could be 
transformed into wiki page or be used to discuss some possible 
changes/improvements in the async code. So here we go, lets start from
the brief overview of the process:

[[[ OVERLORD ]]]

[ func/overlord/client.py ]
Our journey starts in the Overlord() class, which has self.async variable
indicating if we are running in async or normal (non-async or sync) mode. 
Its run() method is invoked each time you call any method that is not 
directly defined in Overlord() class (thru __getattr() magic). Because of 
all the delegation stuff, actual work, that is run on final overlord, is
done in run_direct() method. If self.async is set, jobs are not called
directly but jobthing.batch_run() is used. 

[ func/jobthing.py ]
The first thing done in batch_run() method is to generate job_id. It is 
a string that is made using time when job was called, module, method, 
arguments that where used and glob value used to call the job. It's
basically created like this:
job_id = "".join([glob,"-",module,"-",method,"-",pprint.pformat(time.time())])

At this moment, status for this job_id is set to JOB_ID_RUNNING in *overlord* 
status file (currently dbm file). The process is then forked and parent process
is back to its duties while child takes care of the rest of the work. This 
shouldn't take too much time however since all that is done is to call 
forkbomb.batch_run() which returns job_is returned *from*the*minions*. 
This job_ids are then written to the overlord status file, status is changed 
to JOB_ID_PARTIAL and this child process exits.

[ func/forkbomb.py ]
batch_run() method from forkbomb method takes the list of minions that
we want to run job on (poll) and divides it to the NFORKS number of
smaller lists (buckets). There is one bucket for each fork that will 
be used. Now __forkbomb() method creates (using recursion) NFORKS and
pass one bucket to each of them. __with_my_bucket() method, that iterates
over the bucket and run the job on each minion assigned to it, is called
on each worker process. Remote methods are called in almost the same way
as normal (non-async), the only change is the "async." prefix added to
the method name.

[[[ MINION ]]]

[ func/minion/server.py ]
The minion tale starts in _dispatch() method from FuncSSLXMLRPCServer class. 
If method name starts with "async.", it is removed and the method is called
using jobthing.minion_async_run() instead of direct call. 

[ func/jobthing.py ]
First thing done in minion_async_run() method is job_id generation. Minion side
job_id has different format from the overlord one, it's created like this:
job_id = "%s-minion" % pprint.pformat(time.time())

Now status (in the *minion* status file) of the job is set to JOB_ID_RUNNING 
with result equal to -1. The process is daemonized with double fork technique.
Parent is back to it's normal duties with the daemonized fork running the job 
independently. After work is done, daemonized process writes job result and status 
(JOB_ID_FINISHED) to the minion status file and exits.

There are two things that are worth noting:
  - There are two kinds of job_ids. There is one overlord job_id for each async
    call. This call can result in many minion job_ids (one on each minion).

  - There are two *separate* status files - one on the overlord and one on 
    the minion. They are used to store DIFFERENT informations. Minion status 
    file stores the status (and returned result) of the running job. Overlord
    status file is used to map overlord job_id to the minion job_ids on each 
    minion. After job is finished, overlord status file contains the result
    since we don't need minion job_ids anymore and this allows us to get result
    without doing any remote call (it's a Denis idea and I really think it's good).
    What was wrong in Denis patch was that he also updated overlord state file
    when there where partial results, overwriting remote job_ids and making it
    impossible to ask for the rest of results.

Here's how job status should change over time in case of successful call:
         OVERLORD             MINION1           MINION2
step 1:  JOB_ID_RUNNING       -                 -
step 2:  JOB_ID_RUNNING       JOB_ID_RUNNING    -
step 3:  JOB_ID_PARTIAL       JOB_ID_RUNNING    JOB_ID_RUNNING
step 4:  JOB_ID_PARTIAL       JOB_ID_FINISHED   JOB_ID_RUNNING
step 5:  JOB_ID_PARTIAL       JOB_ID_FINISHED   JOB_ID_FINISHED
step 6:  JOB_ID_FINISHED      JOB_ID_FINISHED   JOB_ID_FINISHED

I hope this e-mail will be helpful for someone. I'm waiting for
corrections and questions :)

_______________________________________________
Func-list mailing list
Func-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/func-list