Re: OSD crash with segfault Luminous 12.2.4

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Fri, 9 Mar 2018 11:01:10 +0100

On 03/09/2018 12:49 AM, Brad Hubbard wrote:
> On Fri, Mar 9, 2018 at 3:54 AM, Subhachandra Chandra
> <schandra@xxxxxxxxxxxx> wrote:
>> I noticed a similar crash too. Unfortunately, I did not get much info in the
>> logs.
>>
>>  *** Caught signal (Segmentation fault) **
>>
>> Mar 07 17:58:26 data7 ceph-osd-run.sh[796380]:  in thread 7f63a0a97700
>> thread_name:safe_timer
>>
>> Mar 07 17:58:28 data7 ceph-osd-run.sh[796380]: docker_exec.sh: line 56:
>> 797138 Segmentation fault      (core dumped) "$@"
> 
> The log isn't very helpful AFAICT. Are these both container
> environments? If so, what are the details (OS, etc.).

In my case (reported in the OP) it is not a container. I'm running

- ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
- CentOS 7.4 (fully updated on 03.02.2017)
- Spectre and Meltdown workarrounds disabled (kerrnel options: noibrs
noibpb nopti)

3x MON/MDS hosts (128GB RAM)
10x OSD hosts 22 HDD + 2 SDD osds + 2 NVME for wal/db each (128GB)

ceph is using bluestore
wal and db are separated on NVME devices (1GB wal, 64GB db)

3 pools:
  1: 3 x replicated (all SSD osds): data
  2: 3 x replicated (all SSD osds): metadata pool for EC pool
  3: 6+3 EC pool (all HDD) -> metadata on pool 2

pools are used for cephfs only
# ceph fs ls
name: cephfs, metadata pool: ssd-rep-metadata-pool, data pools:
[hdd-ec-data-pool ssd-rep-data-pool ]

> 
> Can anyone capture a core file? Please feel free to open a tracker on this.

I've no core file avilable, was not dumped, and so far I've noticed just
that single segfault.

Dietmar

> 
>>
>>
>> Thanks
>>
>> Subhachandra
>>
>>
>>
>> On Thu, Mar 8, 2018 at 6:00 AM, Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx>
>> wrote:
>>>
>>> Hi,
>>>
>>> I noticed in my client (using cephfs) logs that an osd was unexpectedly
>>> going down.
>>> While checking the osd logs for the affected OSD I found that the osd
>>> was seg faulting:
>>>
>>> [....]
>>> 2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
>>> (Segmentation fault) **
>>>  in thread 7fd9af370700 thread_name:safe_timer
>>>
>>>   ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
>>> luminous (stable)
>>>    1: (()+0xa3c611) [0x564585904611]
>>>     2: (()+0xf5e0) [0x7fd9b66305e0]
>>>      NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>> needed to interpret this.
>>> [...]
>>>
>>> Should I open a ticket for this? What additional information is needed?
>>>
>>>
>>> I put the relevant log entries for download under [1], so maybe someone
>>> with more
>>> experience can find some useful information therein.
>>>
>>> Thanks
>>>   Dietmar
>>>
>>>
>>> [1] https://expirebox.com/download/6473c34c80e8142e22032469a59df555.html
>>>
>>> --
>>> _________________________________________
>>> D i e t m a r  R i e d e r, Mag.Dr.
>>> Innsbruck Medical University
>>> Biocenter - Division for Bioinformatics
>>> Email: dietmar.rieder@xxxxxxxxxxx
>>> Web:   http://www.icbi.at
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
> 

-- 
_________________________________________
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rieder@xxxxxxxxxxx
Web:   http://www.icbi.at

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com