Hi Dan, forget what I wrote. I forgot the "-a" option for ulimit. Its still limited to 1024. I'm too tired to start a new test now. I will report back tomorrow afternoon/evening. Thanks for your hint and sorry for the many mails. ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Tuesday, January 14, 2025 9:11 PM To: Dan van der Ster Cc: ceph-users@xxxxxxx Subject: Re: MDS crashing on startup Hi Dan, celebrating too early. Applying our tuned profile results in: # sudo -u ceph ulimit unlimited # sysctl fs.file-max fs.file-max = 26234859 Still, the MDS aborts in exactly the same way: -88> 2025-01-14T14:57:54.511-0500 7f8a88613700 0 log_channel(cluster) log [DBG] : reconnect by client.426286062 v1:192.168.58.69:0/550867185 after 1.01109 -87> 2025-01-14T14:57:54.511-0500 7f8a8be1a700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph-16.2.15/src/msg/async/AsyncMessenger.cc: In function 'void Processor::accept()' thread 7f8a8be1a700 time 2025-01-14T14:57:54.510412-0500 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph-16.2.15/src/msg/async/AsyncMessenger.cc: 214: ceph_abort_msg("abort() called") ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f8a91068904] 2: (Processor::accept()+0x862) [0x7f8a9135b502] 3: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7f8a913b0b87] 4: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f8a913b70bc] 5: /lib64/libstdc++.so.6(+0xc2b23) [0x7f8a8f47ab23] 6: /lib64/libpthread.so.0(+0x81ca) [0x7f8a900451ca] 7: clone() -86> 2025-01-14T14:57:54.511-0500 7f8a88613700 0 log_channel(cluster) log [DBG] : reconnect by client.425227337 v1:192.168.57.49:0/394329910 after 1.01109 [...] -1> 2025-01-14T14:57:54.511-0500 7f8a88613700 0 log_channel(cluster) log [DBG] : reconnect by client.425644021 v1:192.168.58.8:0/3860392786 after 1.01109 0> 2025-01-14T14:57:54.512-0500 7f8a8be1a700 -1 *** Caught signal (Aborted) ** in thread 7f8a8be1a700 thread_name:msgr-worker-0 ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable) 1: /lib64/libpthread.so.0(+0x12d10) [0x7f8a9004fd10] 2: gsignal() 3: abort() 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x7f8a910689d5] 5: (Processor::accept()+0x862) [0x7f8a9135b502] 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7f8a913b0b87] 7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f8a913b70bc] 8: /lib64/libstdc++.so.6(+0xc2b23) [0x7f8a8f47ab23] 9: /lib64/libpthread.so.0(+0x81ca) [0x7f8a900451ca] 10: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Are there other places where abort is called? Could it be a signal from another process? Thanks for helping! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx