Async problems

"Krzysztof A. Adamski" <krzysztofa@xxxxxxxxx> · Thu, 5 Jun 2008 13:38:25 +0200

Yesterday ssalevan created great script testing async Func API. Running
this test we have found several problems and i have spend some time
resolving them. Unfortunately I have an exam tomorrow and wont be able
to work on the issues until then so I'm posting all the informations i
have found so far:

1. Funcd, when running async job, forks letting child do the work and
parent immediately returns to its normal duties. Child exits after
finishing the job but parent is not waiting for SIGCHLD which results
in creation of zombie process on minion per each async job. This was
able to kill my remote testing box yesterday after running couple
hundreds test jobs so it can be dangerous.

I've created a patch yesterday which registered empty SIGCHLD handler
before fork. No more zombie processes where creating but huge memory
leak was observed. I've added some code counting references to each
class type and it quickly turned out that if i register SIGCHLD
handler, SSLConnection objects are not freed and quickly process has
tens of open sockets (can be found in /proc/{PID_OF_FUNCD/fd}). So we
can't wait for any SIGCHLD since in this case we will catch processes
forked by TCPServer and that creates problems. 
I see 2 possible solutions:
a) Use threads instead of processes. We still leak memory but at least
we don't leave sockets opened and we don't create zombies. I will try
figure out what is leaking then.

b) Untested - create a list of pids of forks and call nonblocking
waitpid from time to time to kill zombies.

2. Database state file corruption. Sometimes we where getting an error
with following traceback after running test script in loop couple times:

Traceback (most recent call last):
  File "test.py", line 21, in ?
    jobid = overlord.command.run(cmd)
  File "/usr/lib/python2.4/site-packages/func/overlord/client.py", line
64, in __call__ return
self.clientref.run(module,method,args,nforks=self.nforks) File
"/usr/lib/python2.4/site-packages/func/overlord/client.py", line 311,
in run results = jobthing.batch_run(self.minions, process_server,
nforks) File "/usr/lib/python2.4/site-packages/func/jobthing.py", line
126, in batch_run __update_status(job_id, JOB_ID_PARTIAL, results) File
"/usr/lib/python2.4/site-packages/func/jobthing.py", line 41, in
__update_status return __access_status(jobid=jobid, status=status,
results=results, write=True) File
"/usr/lib/python2.4/site-packages/func/jobthing.py", line 84, in
__access_status __purge_old_jobs(storage) File
"/usr/lib/python2.4/site-packages/func/jobthing.py", line 57, in
__purge_old_jobs for x in storage.keys(): File
"/usr/lib/python2.4/shelve.py", line 98, in keys return
self.dict.keys() File "/usr/lib/python2.4/bsddb/__init__.py", line 244,
in keys return self.db.keys() bsddb._db.DBPageNotFoundError: (-30987,
'DB_PAGE_NOTFOUND: Requested page not found')

It seems to be fixed (at least i can't get error even after 500 calls)
if we lock file before calling:

handle = open(filename,"r")
fcntl.flock(handle.fileno(), fcntl.LOCK_EX)
internal_db = bsddb.btopen(filename, 'c', 0644 )

INSTEAD OF:
internal_db = bsddb.btopen(filename, 'c', 0644 )
handle = open(filename,"r")
fcntl.flock(handle.fileno(), fcntl.LOCK_EX)

The only problem is that we have to first open the file in read mode
(it can't be write mode) to get the lock. If the file does not exist we
have to create it so it adds couple additional lines. I will post the
patch soon.

_______________________________________________
Func-list mailing list
Func-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/func-list