Yesterday ssalevan created great script testing async Func API. Running this test we have found several problems and i have spend some time resolving them. Unfortunately I have an exam tomorrow and wont be able to work on the issues until then so I'm posting all the informations i have found so far: 1. Funcd, when running async job, forks letting child do the work and parent immediately returns to its normal duties. Child exits after finishing the job but parent is not waiting for SIGCHLD which results in creation of zombie process on minion per each async job. This was able to kill my remote testing box yesterday after running couple hundreds test jobs so it can be dangerous. I've created a patch yesterday which registered empty SIGCHLD handler before fork. No more zombie processes where creating but huge memory leak was observed. I've added some code counting references to each class type and it quickly turned out that if i register SIGCHLD handler, SSLConnection objects are not freed and quickly process has tens of open sockets (can be found in /proc/{PID_OF_FUNCD/fd}). So we can't wait for any SIGCHLD since in this case we will catch processes forked by TCPServer and that creates problems. I see 2 possible solutions: a) Use threads instead of processes. We still leak memory but at least we don't leave sockets opened and we don't create zombies. I will try figure out what is leaking then. b) Untested - create a list of pids of forks and call nonblocking waitpid from time to time to kill zombies. 2. Database state file corruption. Sometimes we where getting an error with following traceback after running test script in loop couple times: Traceback (most recent call last): File "test.py", line 21, in ? jobid = overlord.command.run(cmd) File "/usr/lib/python2.4/site-packages/func/overlord/client.py", line 64, in __call__ return self.clientref.run(module,method,args,nforks=self.nforks) File "/usr/lib/python2.4/site-packages/func/overlord/client.py", line 311, in run results = jobthing.batch_run(self.minions, process_server, nforks) File "/usr/lib/python2.4/site-packages/func/jobthing.py", line 126, in batch_run __update_status(job_id, JOB_ID_PARTIAL, results) File "/usr/lib/python2.4/site-packages/func/jobthing.py", line 41, in __update_status return __access_status(jobid=jobid, status=status, results=results, write=True) File "/usr/lib/python2.4/site-packages/func/jobthing.py", line 84, in __access_status __purge_old_jobs(storage) File "/usr/lib/python2.4/site-packages/func/jobthing.py", line 57, in __purge_old_jobs for x in storage.keys(): File "/usr/lib/python2.4/shelve.py", line 98, in keys return self.dict.keys() File "/usr/lib/python2.4/bsddb/__init__.py", line 244, in keys return self.db.keys() bsddb._db.DBPageNotFoundError: (-30987, 'DB_PAGE_NOTFOUND: Requested page not found') It seems to be fixed (at least i can't get error even after 500 calls) if we lock file before calling: handle = open(filename,"r") fcntl.flock(handle.fileno(), fcntl.LOCK_EX) internal_db = bsddb.btopen(filename, 'c', 0644 ) INSTEAD OF: internal_db = bsddb.btopen(filename, 'c', 0644 ) handle = open(filename,"r") fcntl.flock(handle.fileno(), fcntl.LOCK_EX) The only problem is that we have to first open the file in read mode (it can't be write mode) to get the lock. If the file does not exist we have to create it so it adds couple additional lines. I will post the patch soon. _______________________________________________ Func-list mailing list Func-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/func-list