On Fri, May 12, 2017 at 12:52 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote: > On Fri, May 12, 2017 at 12:21 PM, Brian Mathis > <brian.mathis+centos@xxxxxxxxxxxxxxx> wrote: >> On Fri, May 12, 2017 at 11:44 AM, Larry Martell <larry.martell@xxxxxxxxx> >> wrote: >> >>> On Thu, May 11, 2017 at 7:58 PM, Alexander Dalloz <ad+lists@xxxxxxxxx> >>> wrote: >>> > Am 11.05.2017 um 20:30 schrieb Larry Martell: >>> >> >>> >> On Wed, May 10, 2017 at 3:19 PM, Larry Martell <larry.martell@xxxxxxxxx >>> > >>> >> wrote: >>> >>> >>> >>> On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings < >>> billings@xxxxxxxxxx> >>> >>> wrote: >>> >>>> >>> >>>> On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote: >>> >>>>> >>> >>>>> I have a CentOS 7 system that I run a home grown python daemon on. I >>> >>>>> run this same daemon on many other systems without any incident. On >>> >>>>> this one system the daemon seems to die or be killed every day around >>> >>>>> 3:30am. There is nothing it its log or any system logs that tell me >>> >>>>> why it dies. However in /var/log/messages every day I see something >>> >>>>> like this: >>> >>>> >>> >>>> >>> >>>> How are you starting this daemon? >>> >>> >>> >>> >>> >>> I am using code something like this: >>> >>> https://gist.github.com/slor/5946334. >>> >>> >>> >>>> Can you check the journal? Perhaps >>> >>>> you'll see more useful information than what you see in the syslogs? >>> >>> >>> >>> >>> >>> Thanks, I will do that. >>> >> >>> >> >>> >> Thank you for that suggestion. I was able to get someone to run >>> >> journalctl and send me the output and it was very interesting. >>> >> >>> >> First, there is logging going on continuously during the time when >>> >> logging stops in /var/log/messages. >>> >> >>> >> Second, I see messages like this periodically: >>> >> >>> >> May 10 03:57:46 localhost.localdomain python[40222]: detected >>> >> unhandled Python exception in >>> >> '/usr/local/motor/motor/core/data/importer.py' >>> >> May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is >>> >> available on /var/spool/abrt >>> >> May 10 03:57:46 localhost.localdomain python[40222]: error sending >>> >> data to ABRT daemon: >>> >> >>> >> This happens at various times of the day, and I do not think is >>> >> related to the daemon crashing. >>> >> >>> >> But I did see one occurrence of this: >>> >> >>> >> May 09 03:49:35 localhost.localdomain python[14042]: detected >>> >> unhandled Python exception in >>> >> '/usr/local/motor/motor/core/data/importerd.py' >>> >> May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is >>> >> available on /var/spool/abrt >>> >> May 09 03:49:35 localhost.localdomain python[14042]: error sending >>> >> data to ABRT daemon: >>> >> >>> >> And that is the daemon. But I only see that on this one day, and it >>> >> crashes every day. >>> >> >>> >> And I see this type of message frequently throughout the day, every day: >>> >> >>> >> May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD >>> >> (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d >>> >> /usr/local/motor/data > ~/last_image_move_log.txt) >>> >> May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is >>> >> available on /var/spool/abrt >>> >> May 09 03:40:01 localhost.localdomain python[21402]: error sending >>> >> data to ABRT daemon: >>> >> May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]: >>> >> warning: uid=0: No space left on device >>> >> May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal: >>> >> root(0): queue file write error >>> >> May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning: >>> >> uid=0: No space left on device >>> >> May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal: >>> >> root(0): queue file write error >>> >> May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL >>> >> (mailed 67 bytes of output but got status 0x004b) >>> >> >>> >> So it seems there is a space issue. >>> >> >>> >> And finally, coinciding with the time that the logging resumes in >>> >> /var/log/messages I see this every day at that time: >>> >> >>> >> May 10 03:57:57 localhost.localdomain >>> >> run-parts(/etc/cron.daily)[40293]: finished mlocate >>> >> May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily' >>> >> terminated (mailing output) >>> >> May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job >>> >> run) >>> >> >>> >> I need to get my remote hands to get me more info. >>> > >>> > >>> > df -hT; df -i >>> > >>> > There is no space left on a vital partition / logical volume. >>> > >>> > "Only 0MiB is available on /var/spool/abrt" >>> > >>> > "postdrop: warning: uid=0: No space left on device" >>> >>> Yes, I saw that and assumed that was the root cause of the issue. But >>> when I had my guy over in Japan check he found that / had 15G (of 50) >>> free. We did some more investigating and it seems that when mlocate >>> runs the disk fills up and bad things happen. Why is that happening? >>> It is because 15G free space is not enough? We ran a du and most of >>> the space on / was used by /var/log (11G), and /var/lib/mlocate (20G). >>> Can I disable mlocate and get rid of that large dir? >>> >> >> >> 20GB for mlocate is absolutely (and suspiciously) huge. You must have >> millions and millions of files on that server. If not, then there's >> something wrong with mlocate. 'mlocate' can be removed unless you're using >> it, there's nothing else really dependent on it in CentOS. You'd need to >> really evaluate if someone else is using it on that server. > > Yes, we do have millions and millions of files (give or take a million > or so). I am going to disable mlocate and remove the db and see if > this fixes the issues we've been having. Since disabling mlocate and removing its db we have not had the daemon crash. Thanks to all who helped! _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx https://lists.centos.org/mailman/listinfo/centos