Re: core dump / segfault after 48 hour run

Jens Axboe <axboe@xxxxxxxxx> · Mon, 30 Sep 2013 12:18:48 -0600

On 09/30/2013 12:13 PM, Jens Axboe wrote:
> On 09/30/2013 10:20 AM, Roger Sibert wrote:
>> On Mon, Sep 30, 2013 at 12:07 PM, Jens Axboe <axboe@xxxxxxxxx> wrote:
>>> On 09/30/2013 07:04 AM, Roger Sibert wrote:
>>>> Hello Everyone,
>>>>
>>>> I was looking to use fio to run full disks writes to a SSD after doing
>>>> a secure erase to measure/see how long it takes before the performance
>>>> stabilizes.  Give or take after about 48 hours I see this on the
>>>> screen.
>>>>
>>>> B2-058:~/longtermruntime # ./fio.64bit.static longtermruntime-192h.fio
>>>> seqwrite-phase: (g=0): rw=write, bs=512K-512K/512K-512K/512K-512K,
>>>> ioengine=libaio, iodepth=16
>>>> fio-2.1.2-15-gd5603
>>>> Starting 1 process
>>>> fio: pid=6895, got signal=11ne] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>>>> 06d:07h:05m:31s]
>>>>
>>>> seqwrite-phase: (groupid=0, jobs=1): err= 0: pid=6895: Sun Sep 29 03:40:38 2013
>>>>     lat (usec) : 1000=0.01%
>>>>     lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=99.15%
>>>>     lat (msec) : 100=0.56%, 250=0.28%, 500=0.01%, 750=0.01%
>>>>   cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
>>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      issued    : total=r=0/w=67108865/d=0, short=r=0/w=0/d=0
>>>>
>>>> Run status group 0 (all jobs):
>>>>   WRITE: io=0KB, aggrb=0KB/s, minb=0KB/s, maxb=0KB/s,
>>>> mint=144006511329msec, maxt=144006511329msec
>>>>
>>>> Disk stats (read/write):
>>>>   sdb: ios=0/67108865, merge=0/0, ticks=0/2354077568,
>>>> in_queue=2353971492, util=100.00%
>>>> fio: file hash not empty on exit
>>>>
>>>> I took a look at one of the core files
>>>>
>>>> B2-057:~/longtermruntime # gdb core core
>>>> GNU gdb (GDB) SUSE (7.0-0.4.16)
>>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>>>> This is free software: you are free to change and redistribute it.
>>>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>>>> and "show warranty" for details.
>>>> This GDB was configured as "x86_64-suse-linux".
>>>> For bug reporting instructions, please see:
>>>> <http://www.gnu.org/software/gdb/bugs/>...
>>>> "/root/longtermruntime/core": not in executable format: File format
>>>> not recognized
>>>> Missing separate debuginfo for the main executable file
>>>> Try: zypper install -C
>>>> "debuginfo(build-id)=559375f8a046f376897b4923007bff5b07ecd8d4"
>>>> Core was generated by `./fio.64bit.static longtermruntime-216h.fio'.
>>>> Program terminated with signal 11, Segmentation fault.
>>>> #0  0x000000000040a6c9 in ?? ()
>>>>
>>>> Is there anything else that I can do prior to help pull out more debug
>>>> using gdb prior to restarting/retasking this systems?  My gdb skills
>>>> arent that great.
>>>
>>> I know it's a pain to reproduce (especially after a 48h run), but if you
>>> could edit the Makefile and remove the -O3 from the OPTFLAGS, then make
>>> clean, make all, and then reproduce. Then the core files will be of more
>>> use.
>>>
>>> For the core files you have now, try and do a 'bt' when you open them so
>>> I can see a backtrace. That might be enough to see what is going on.
>>>
>>> --
>>> Jens Axboe
>>>
>>
>> Let me try that again...  My gdb skills may be bad but it doesnt mean
>> I shouldnt recognize I was missing something.
>>
>> Changed how I called the core file which should have what you where
>> actually asking for.
>>
>> B2-057:~/longtermruntime # gdb ./fio.64bit.static ./core
>> GNU gdb (GDB) SUSE (7.0-0.4.16)
>> Copyright (C) 2009 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-suse-linux".
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>...
>> Reading symbols from /root/longtermruntime/fio.64bit.static...done.
>>
>> warning: core file may not match specified executable file.
>> Core was generated by `./fio.64bit.static longtermruntime-216h.fio'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x000000000040a6c9 in __add_log_sample (iolog=0x872510, val=62,
>> ddir=<value optimized out>, bs=<value optimized out>,
>>     t=<value optimized out>) at stat.c:1517
>> 1517    stat.c: No such file or directory.
>>         in stat.c
>> (gdb) bt
>> #0  0x000000000040a6c9 in __add_log_sample (iolog=0x872510, val=62,
>> ddir=<value optimized out>, bs=<value optimized out>,
>>     t=<value optimized out>) at stat.c:1517
>> #1  0x0000000000440b05 in fio_libaio_queued (nr=1, io_us=0x8929a0,
>> td=0x7fe5e312b000) at engines/libaio.c:199
>> #2  fio_libaio_commit (nr=1, io_us=0x8929a0, td=0x7fe5e312b000) at
>> engines/libaio.c:218
>> #3  0x0000000000405385 in td_io_commit (td=0x7fe5e312b000) at ioengines.c:379
>> #4  0x000000000040572a in td_io_queue (td=0x7fe5e312b000,
>> io_u=0x891f20) at ioengines.c:329
>> #5  0x000000000043692f in do_io (td=0x7fe5e312b000) at backend.c:701
>> #6  thread_main (td=0x7fe5e312b000) at backend.c:1314
>> #7  0x0000000000438447 in fork_main (offset=0, shmid=<value optimized
>> out>) at backend.c:1464
>> #8  run_threads (offset=0, shmid=<value optimized out>) at backend.c:1726
>> #9  0x000000000043889d in fio_backend () at backend.c:1912
>> #10 0x00000000004702a4 in __libc_start_main ()
>> #11 0x0000000000000000 in ?? ()
> 
> OK, that helps a whole lot. So my guess it that you ran out of memory.
> Currently fio does not flush out the existing log, it just keeps
> appending to it and flushes at the end. This is done to not disturb the
> actual data run, but it does mean that for long runs, you can gobble up
> a lot of memory...
> 
> I will commit something that is a little more defensive so we don't
> actually segfault, just stop logging. Then we can look into handling it
> better in the future.

I committed this:

http://git.kernel.dk/?p=fio.git;a=commit;h=3c568239a319087a965b06bc2ed94d058810100f

to handle the failure a bit more gracefully at least.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html