Re: 16.2.7 pacific rocksdb Corruption: CURRENT

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/20/21 13:14, Igor Fedotov wrote:

On 12/20/2021 2:58 PM, Andrej Filipcic wrote:
On 12/20/21 12:47, Igor Fedotov wrote:

Thanks for the info.

Just in case - is write caching disabled for the disk in question? What's the output for "hdparm -W </path-to-disk-dev>" ?

no, it is enabled. Shall I disable that on all OSDs?

I can't tell you for sure if this is the root cause. Generally upstream recommends to disable write caching due to multiple performance issues we observed. I don't recall any one about data corruption though. But still can imagine something like that. On the other hand as far as I could see from the initial log there were rather no node reboot/shutdown on upgrade hence  hardware write caching is unlikely to be involved. Am I right about no node shutdown in you case?
yes, only the ceph services were restarted.

And it would be an interesting experiment whether it data corruption is related indeed. So it would be great if you can test that...


One more question please - is this a bare metal deployment or containerized (Rook?) one?
bare metal on rhel8.3, with 5.11.4 elrepo kernel that needs updating at some point. It was initially deployed with ceph ansible.

And I presume OSD restart  is a rare event in your cluster, isn't it? That's why you probably haven't faced the issue before...
OSD restarts are rare, they have been running from September. well, I had several crashes in the meantime, and some of them also caused corruptions.

Actually, when updating to 16.2.7, many osds had this crash, but recovered OK: [root@lcst0001 ~]# ceph crash info 2021-12-20T05:28:07.001230Z_bd286ae8-7867-4040-89c6-8d2de7794a76
{
   "assert_condition": "(sharded_in_flight_list.back())->ops_in_flight_sharded.empty()",    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigant
ic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/common/TrackedOp.cc",
   "assert_func": "OpTracker::~OpTracker()",
   "assert_line": 173,
   "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/giganti c/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/common/TrackedOp.cc: In function 'OpTracker::~OpTracker()' thread 7fbd3373e080 time 2021-12-20T06:28:06.99380
7+0100\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/1
6.2.6/rpm/el8/BUILD/ceph-16.2.6/src/common/TrackedOp.cc: 173: FAILED ceph_assert((sharded_in_flight_list.back())->ops_in_flight_sharded.empty())\n",
   "assert_thread_name": "ceph-osd",
   "backtrace": [
       "/lib64/libpthread.so.0(+0x12b20) [0x7fbd316ecb20]",
       "gsignal()",
       "abort()",
       "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x5593fc36a59d]",
       "/usr/bin/ceph-osd(+0x56a766) [0x5593fc36a766]",
       "(OpTracker::~OpTracker()+0x39) [0x5593fc755559]",
       "(OSD::~OSD()+0x304) [0x5593fc4a9994]",
       "(OSD::~OSD()+0xd) [0x5593fc4a9b5d]",
       "main()",
       "__libc_start_main()",
       "_start()"
   ],
   "ceph_version": "16.2.6",
   "crash_id": "2021-12-20T05:28:07.001230Z_bd286ae8-7867-4040-89c6-8d2de7794a76",
   "entity_name": "osd.1396",
   "os_id": "rhel",
   "os_name": "Red Hat Enterprise Linux",
   "os_version": "8.3 (Ootpa)",
   "os_version_id": "8.3",
   "process_name": "ceph-osd",
   "stack_sig": "d247f79a887d3f92ed5377a4aabc407a8e8ab4392f99134800755c6450b8ce6f",
   "timestamp": "2021-12-20T05:28:07.001230Z",
   "utsname_hostname": "lcst0057",
   "utsname_machine": "x86_64",
   "utsname_release": "5.11.4-1.el8.elrepo.x86_64",
   "utsname_sysname": "Linux",
   "utsname_version": "#1 SMP Sun Mar 7 08:41:44 EST 2021"
}


Another series of crashes appeared when I disabled scrubbing, and some of the OSDs had to be reinitialized after

[root@lcst0001 ~]# ceph crash info 2021-12-04T14:19:28.102548Z_9b2606a1-a334-4a97-95ec-169cd013bf0b
{
   "assert_condition": "state_cast<const NotActive*>()",
   "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigant
ic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/scrub_machine.cc",
   "assert_func": "void Scrub::ScrubMachine::assert_not_active() const",
   "assert_line": 55,
   "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/giganti c/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/scrub_machine.cc: In function 'void Scrub::ScrubMachine::assert_not_active() const' thread 7fcde174e700 t ime 2021-12-04T15:19:28.092559+0100\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MA CHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/scrub_machine.cc: 55: FAILED ceph_assert(state_cast<const NotActive*>())\n",
   "assert_thread_name": "tp_osd_tp",
   "backtrace": [
       "/lib64/libpthread.so.0(+0x12b20) [0x7fce05769b20]",
       "gsignal()",
       "abort()",
       "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x563f50b6a59d]",
       "/usr/bin/ceph-osd(+0x56a766) [0x563f50b6a766]",
       "/usr/bin/ceph-osd(+0x9e4dcf) [0x563f50fe4dcf]",
       "(PgScrubber::replica_scrub_op(boost::intrusive_ptr<OpRequest>)+0x4bf) [0x563f50fd530f]",
       "(PG::replica_scrub(boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x62) [0x563f50d209e2]",        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x7bb) [0x563f50de5f4b]",        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x563f50c6f1b9]",        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x563f50ecc868]",        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x563f50c8f1e8]",        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x563f512fa6c4]",        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563f512fd364]",
       "/lib64/libpthread.so.0(+0x814a) [0x7fce0575f14a]",
       "clone()"
   ],
   "ceph_version": "16.2.6",
   "crash_id": "2021-12-04T14:19:28.102548Z_9b2606a1-a334-4a97-95ec-169cd013bf0b",
   "entity_name": "osd.209",
   "os_id": "rhel",
   "os_name": "Red Hat Enterprise Linux",
   "os_version": "8.3 (Ootpa)",
   "os_version_id": "8.3",
   "process_name": "ceph-osd",
   "stack_sig": "42f4a4f71dfb4c78153e86327c6b3213f94806652d846a9002bd7ebcb05552cf",
   "timestamp": "2021-12-04T14:19:28.102548Z",
   "utsname_hostname": "lcst0007",
   "utsname_machine": "x86_64",
   "utsname_release": "5.11.4-1.el8.elrepo.x86_64",
   "utsname_sysname": "Linux",
   "utsname_version": "#1 SMP Sun Mar 7 08:41:44 EST 2021"
}




Best regards,
Andrej

Thanks in advance,



--
_____________________________________________________________
   prof. dr. Andrej Filipcic,   E-mail:Andrej.Filipcic@xxxxxx
   Department of Experimental High Energy Physics - F9
   Jozef Stefan Institute, Jamova 39, P.o.Box 3000
   SI-1001 Ljubljana, Slovenia
   Tel.: +386-1-477-3674    Fax: +386-1-477-3166
-------------------------------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux