Re: osd crash: Caught signal (Aborted) thread_name:tp_osd_tp

Milan Kupcevic <milan_kupcevic@xxxxxxxxxxx> · Tue, 24 Nov 2020 22:17:14 -0500

Hi Igor,

Thank you for quick and useful answer. We are looking at our options.

Milan

On 2020-11-24 06:49, Igor Fedotov wrote:
> Another workaround would be to delete the object in question using
> ceph-objectstore-tool and then do a scrub on the corresponding PG to fix
> the absent object.
> 
> But I would greatly appreciate if we dissect this case for a bit....
> 
> 
> On 11/24/2020 9:55 AM, Milan Kupcevic wrote:
>> Hello,
>>
>> Three OSD daemons crash at the same time while processing the same
>> object located in an rbd ec4+2 pool leaving a placement group in
>> inactive down state. Soon after I start the osd daemons back up they
>> crash again choking on the same object.
>>
>> ----------------------------8<------------------------------------
>> _dump_onode 0x5605a27ca000
>> 4#7:8565da11:::rbd_data.6.a8a8356fd674f.00000000003dce34:head# nid
>> 1889617 size 0x100000 (1048576) expected_object_size 0
>> expected_write_size 0 in 8 shards, 32768 spanning blobs
>> ----------------------------8<------------------------------------
>>
>> Please take a look at the attached log file.
>>
>>
>> Ceph status reports:
>>
>> Reduced data availability: 1 pg inactive, 1 pg down
>>
>>
>> Any hints on how to get this placement group back online would be
>> greatly appreciated.
>>
>>
>> Milan
>>
>>

-- 
Milan Kupcevic
Senior Cyberinfrastructure Engineer at Project NESE
Harvard University
FAS Research Computing
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx