TR: OSDs crash randomnisly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




De : Igor Fedotov <igor.fedotov@xxxxxxxx>
Envoyé : jeudi 17 février 2022 16:01
À : Wissem MIMOUNA <wissem.mimouna@xxxxxxxxxxxxxxxx>
Objet : Re:  OSDs crash randomnisly

Wissem, unfortunately there is no way to learn if zombies has appeared other than runnig fsck. But I think this can be perfomed on a weekly or even monthly basis - from my experience getting 32K zombies is a pretty rare case. But definitely ZjQcmQRYFpfptBannerStart

Cet e-mail provient d'un expéditeur externe à l'entreprise. S'il contient des pièces jointes et/ou des liens URL, nous vous demandons de redoubler de vigilance.
En cas de doute, transférez cet e-mail en pièce jointe pour analyse à l'adresse suivante : dit.ssi@xxxxxxxxxxxx<mailto:dit.ssi@xxxxxxxxxxxx>
Un retour sur la dangerosité ou l’innocuité de ce dernier vous sera fait dans les meilleurs délais, à l'issue de l'analyse.
This e-mail was sent by an external sender. If it contains attachments and/or URL links, we ask you to be extremely vigilant.
If you have a doubt, please forward this e-mail for analysis to the following address: dit.ssi@xxxxxxxxxxxx<mailto:dit.ssi@xxxxxxxxxxxx>
After analysis, you will be informed asap of the danger or harmlessness of the e-mail.



ZjQcmQRYFpfptBannerEnd

Wissem,

unfortunately there is no way to learn if zombies has appeared other than runnig fsck. But I think this can be perfomed on a weekly or even monthly basis - from my experience getting 32K zombies is a pretty rare case. But definitely it's more reliable if you collect that statistics from the cluster yourself...



Thanks,

Igor
On 2/17/2022 5:43 PM, Wissem MIMOUNA wrote:
Hi Igor,

Thank you very much  this helped us to understand the root cause and hope we will get a fix soon ( with new ceph release ) .
In the means time do you have any idea how to perdiocally check the zombies spanning blobs ( before running the fsck/repair ) soi t would be nice for us to automate this action ?

Have a good day
Best Regards

De : Igor Fedotov <igor.fedotov@xxxxxxxx><mailto:igor.fedotov@xxxxxxxx>
Envoyé : jeudi 17 février 2022 11:59
À : Wissem MIMOUNA <wissem.mimouna@xxxxxxxxxxxxxxxx><mailto:wissem.mimouna@xxxxxxxxxxxxxxxx>; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
Objet : Re:  OSDs crash randomnisly

Hi Wissem, first of all the bug wasn't fixed with the PR you're referring - it just added additional log output on the problem detection. Unfortunately the bug isn't fixed yet as the root cause for zombie spanning blobs appearance is still ZjQcmQRYFpfptBannerStart

Cet e-mail provient d'un expéditeur externe à l'entreprise. S'il contient des pièces jointes et/ou des liens URL, nous vous demandons de redoubler de vigilance.
En cas de doute, transférez cet e-mail en pièce jointe pour analyse à l'adresse suivante : dit.ssi@xxxxxxxxxxxx<mailto:dit.ssi@xxxxxxxxxxxx>
Un retour sur la dangerosité ou l’innocuité de ce dernier vous sera fait dans les meilleurs délais, à l'issue de l'analyse.
This e-mail was sent by an external sender. If it contains attachments and/or URL links, we ask you to be extremely vigilant.
If you have a doubt, please forward this e-mail for analysis to the following address: dit.ssi@xxxxxxxxxxxx<mailto:dit.ssi@xxxxxxxxxxxx>
After analysis, you will be informed asap of the danger or harmlessness of the e-mail.



ZjQcmQRYFpfptBannerEnd

Hi Wissem,



first of all the bug wasn't fixed with the PR you're referring - it just

added additional log output on the problem detection.



Unfortunately the bug isn't fixed yet as the root cause for zombie

spanning blobs appearance is still unclear.  The relevant ticket is

https://urldefense.proofpoint.com/v2/url?u=https-3A__tracker.ceph.com_issues_48216&d=DwIDaQ&c=1tDFxPZjcWEmlmmx4CZtyA&r=h1fIFv3Ydv-kdH6KKa6lmB20LbjUiXP9Kttb6tTs__E&m=qLWMCbQXcsEbpy_nv7K1LrtsLUdMa_0kSFySeYekDnzrLtk1z5Op-e5NhBx9CHpl&s=O4tOWjKrpViuiw_Pki_k4YwmYrrys7JdMcNoU6NnEuM&e=





There is a workaround though - ceph-bluestore-tool's repair command

would detect zombie spanning blobs and remove them which should

eliminate the assertion for a while.



I'd recommend to run fsck/repair periodically as it looks like your

cluster is exposed to the problem and zombies would rather come back -

it's crucial to keep their amount below 32K per PG to avoid the assertion.





Thanks,



Igor



On 2/17/2022 1:41 PM, Wissem MIMOUNA wrote:

> Dear,

>

> Some ODSs on our ceph cluster crush with no explication .

> Stop and Start of the crushed OSD daemon fixed the issue but this happend few times and I just need to understand the reason.

> For your information the error has been fixed in the log change in the octopus release (https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ceph_ceph_pull_27911&d=DwIDaQ&c=1tDFxPZjcWEmlmmx4CZtyA&r=h1fIFv3Ydv-kdH6KKa6lmB20LbjUiXP9Kttb6tTs__E&m=qLWMCbQXcsEbpy_nv7K1LrtsLUdMa_0kSFySeYekDnzrLtk1z5Op-e5NhBx9CHpl&s=r9Hwlqjk12vnAYENgctY1e-f2OFM48BX9EdNeSs5Yh4&e= ).

> Below the logs related to the crash :

>

>

>      "process_name": "ceph-osd",

>      "entity_name": "osd.x",

>      "ceph_version": "15.2.15",

>      "utsname_hostname": "",

>      "utsname_sysname": "Linux",

>      "utsname_release": "4.15.0-162-generic",

>      "utsname_version":

>      "utsname_machine": "x86_64",

>      "os_name": "Ubuntu",

>      "os_id": "ubuntu",

>      "os_version_id": "18.04",

>      "os_version": "18.04.6 LTS (Bionic Beaver)",

>      "assert_condition": "abort",

>      "assert_func": "bid_t BlueStore::ExtentMap::allocate_spanning_blob_id()",

>      "assert_file": "/build/ceph-15.2.15/src/os/bluestore/BlueStore.cc",

>      "assert_line": 2664,

>      "assert_thread_name": "tp_osd_tp",

>      "assert_msg": "/build/ceph-15.2.15/src/os/bluestore/BlueStore.cc: In function 'bid_t BlueStore::ExtentMap::allocate_spanning_blob_id()' thread 7f6d37800700 time 2022-02-17T09:41:55.108101+0100\n/build/ceph-15.2.15/src/os/bluestore/BlueStore.cc: 2664: ceph_abort_msg(\"no available blob id\")\n",

>      "backtrace": [

>          "(()+0x12980) [0x7f6d59516980]",

>          "(gsignal()+0xc7) [0x7f6d581c8fb7]",

>          "(abort()+0x141) [0x7f6d581ca921]",

>          "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b2) [0x55ddc61f245f]",

>          "(BlueStore::ExtentMap::allocate_spanning_blob_id()+0x104) [0x55ddc674b594]",

>          "(BlueStore::ExtentMap::reshard(KeyValueDB*, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x1408) [0x55ddc674c9c8]",

>          "(BlueStore::_record_onode(boost::intrusive_ptr<BlueStore::Onode>&, std::shared_ptr<KeyValueDB::TransactionImpl>&)+0x91c) [0x55ddc674f4ec]",

>          "(BlueStore::_txc_write_nodes(BlueStore::TransContext*, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x7e) [0x55ddc6751b4e]",

>          "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2fc) [0x55ddc677892c]",

>          "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55ddc63eef44]",

>          "(ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&)+0x9cd) [0x55ddc65cb95d]",

>          "(ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x23d) [0x55ddc65e3c2d]",

>          "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x55ddc643b157]",

>          "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x6fd) [0x55ddc63ddddd]",

>          "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b) [0x55ddc62618bb]",

>          "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x67) [0x55ddc64bf167]",

>          "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90c) [0x55ddc627ef4c]",

>          "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x55ddc68d1d0c]",

>          "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ddc68d4f60]",

>          "(()+0x76db) [0x7f6d5950b6db]",

>          "(clone()+0x3f) [0x7f6d582ab71f]"

>      ]

>

>

> Best Regards

>

>

>

>

>

>

>

> _______________________________________________

> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>

> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>



--

Igor Fedotov

Ceph Lead Developer



Looking for help with your Ceph cluster? Contact us at https://urldefense.proofpoint.com/v2/url?u=https-3A__croit.io&d=DwIDaQ&c=1tDFxPZjcWEmlmmx4CZtyA&r=h1fIFv3Ydv-kdH6KKa6lmB20LbjUiXP9Kttb6tTs__E&m=qLWMCbQXcsEbpy_nv7K1LrtsLUdMa_0kSFySeYekDnzrLtk1z5Op-e5NhBx9CHpl&s=To96cIczuRPiq3rxeZms1pDgAXxC_wkWVuonZ5kzdGI&e=



croit GmbH, Freseniusstr. 31h, 81247 Munich

CEO: Martin Verges - VAT-ID: DE310638492

Com. register: Amtsgericht Munich HRB 231263

Web: https://urldefense.proofpoint.com/v2/url?u=https-3A__croit.io&d=DwIDaQ&c=1tDFxPZjcWEmlmmx4CZtyA&r=h1fIFv3Ydv-kdH6KKa6lmB20LbjUiXP9Kttb6tTs__E&m=qLWMCbQXcsEbpy_nv7K1LrtsLUdMa_0kSFySeYekDnzrLtk1z5Op-e5NhBx9CHpl&s=To96cIczuRPiq3rxeZms1pDgAXxC_wkWVuonZ5kzdGI&e=  | YouTube: https://urldefense.proofpoint.com/v2/url?u=https-3A__goo.gl_PGE1Bx&d=DwIDaQ&c=1tDFxPZjcWEmlmmx4CZtyA&r=h1fIFv3Ydv-kdH6KKa6lmB20LbjUiXP9Kttb6tTs__E&m=qLWMCbQXcsEbpy_nv7K1LrtsLUdMa_0kSFySeYekDnzrLtk1z5Op-e5NhBx9CHpl&s=hg4loFGTGW4Quhq2BU1RoL7gJZdIwVaSm9LCC3iD_mU&e=



--

Igor Fedotov

Ceph Lead Developer



Looking for help with your Ceph cluster? Contact us at https://croit.io<https://urldefense.proofpoint.com/v2/url?u=https-3A__croit.io&d=DwMFaQ&c=1tDFxPZjcWEmlmmx4CZtyA&r=h1fIFv3Ydv-kdH6KKa6lmB20LbjUiXP9Kttb6tTs__E&m=c-hsUvXo6o8LqB-j__lBpOHJcId_SAfkAwdHOrIihRvaT3-akQQfCzDz-qdMio0e&s=ScSIIYCDlrNAuJfDp5ANWUIVuPRsNFSQkhPKc5Ji0ls&e=>



croit GmbH, Freseniusstr. 31h, 81247 Munich

CEO: Martin Verges - VAT-ID: DE310638492

Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io<https://urldefense.proofpoint.com/v2/url?u=https-3A__croit.io&d=DwMFaQ&c=1tDFxPZjcWEmlmmx4CZtyA&r=h1fIFv3Ydv-kdH6KKa6lmB20LbjUiXP9Kttb6tTs__E&m=c-hsUvXo6o8LqB-j__lBpOHJcId_SAfkAwdHOrIihRvaT3-akQQfCzDz-qdMio0e&s=ScSIIYCDlrNAuJfDp5ANWUIVuPRsNFSQkhPKc5Ji0ls&e=> | YouTube: https://goo.gl/PGE1Bx<https://urldefense.proofpoint.com/v2/url?u=https-3A__goo.gl_PGE1Bx&d=DwMFaQ&c=1tDFxPZjcWEmlmmx4CZtyA&r=h1fIFv3Ydv-kdH6KKa6lmB20LbjUiXP9Kttb6tTs__E&m=c-hsUvXo6o8LqB-j__lBpOHJcId_SAfkAwdHOrIihRvaT3-akQQfCzDz-qdMio0e&s=T4zioTZaaH3Lcunq9psYVk-Ks-SJHTxpml9ETFHwVHI&e=>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux