Re: OSD crashes create_aligned_in_mempool in 15.2.9 and 14.2.16

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Andrej,

I wonder if this is another manifestation of the buggy gperftools-libs
v2.8 bug, e.g. https://tracker.ceph.com/issues/49618

If so, there is a fixed (downgraded) version in epel-testing now.

Cheers, Dan




On Tue, Mar 9, 2021 at 4:36 PM Andrej Filipcic <andrej.filipcic@xxxxxx> wrote:
>
>
> Hi,
>
> under heavy load our cluster is experiencing frequent OSD crashes. Is
> this a known bug or should I report it? Any workarounds? It looks to be
> highly correlated with memory tuning.
>
> it happens with both nautilus 14.2.16 and octopus 15.2.9. I have forced
> the bitmap bluefs and bluestore allocator.
>
> the cluster is ~60 nodes with 256GB ram and 25Gb NICs, ~1600 OSDs on
> 100g network. Typically it is happening when the traffic is above 100GB/s.
>
> Best regards,
> Andrej
>
>
>     -14> 2021-03-09T14:10:30.105+0100 7fc128e05700 10 monclient: tick
>     -13> 2021-03-09T14:10:30.105+0100 7fc128e05700 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after
> 2021-03-09T14:10:00.107344+0100)
>     -12> 2021-03-09T14:10:30.210+0100 7fc119412700  5 osd.209 9539
> heartbeat osd_stat(store_statfs(0xe68762a0000/0x40000000/0xe8d7fc00000,
> data 0x24b195d923/0x24c97c0000, compress 0x0/0x0/0x0, omap 0xf5dca, meta
> 0x3ff0a236), peers
> [6,8,11,21,22,23,24,25,26,27,28,29,34,35,37,38,65,86,87,90,120,128,129,135,136,140,150,153,154,160,184,188,192,193,203,208,210,217,229,233,242,248,254,25
> 6,275,277,282,290,311,313,324,326,331,339,348,369,409,411,413,466,477,532,535,538,539,542,544,546,548,552,554,556,558,561,576,580,600,601,604,612,614,624,625,631,657,689,695,704,717,738,739,740,766,790,810,812,833,839,890,891,895,903,909,916,926,927,946,960,965,991,1050,1055,1062,1064,1067,1069,1072,1073,1075,1077,1078,1079,1095,1100,1117,1125,1127,1141,1148,1149,1153,1155,1195,12
> 01,1202,1215,1229,1238,1253,1260,1283,1290,1298,1303,1329,1330,1349,1350,1388,1389,1422,1423,1430,1431,1434,1448,1455,1478,1479,1485,1488,1494,1497,1506,1516,1561,1564,1573,1574,1580]
> op hist [0,0,0,1,3,4,15,24,43,64,102,117])
>     -11> 2021-03-09T14:10:30.468+0100 7fc137cc0700 10 monclient:
> handle_auth_request added challenge on 0x55c08becf800
>     -10> 2021-03-09T14:10:30.543+0100 7fc1374bf700 10 monclient:
> handle_auth_request added challenge on 0x55c08becf400
>      -9> 2021-03-09T14:10:30.712+0100 7fc1384c1700 10 monclient:
> handle_auth_request added challenge on 0x55c08becec00
>      -8> 2021-03-09T14:10:31.029+0100 7fc137cc0700 10 monclient:
> handle_auth_request added challenge on 0x55c08becfc00
>      -7> 2021-03-09T14:10:31.033+0100 7fc12ca33700  5 prioritycache
> tune_memory target: 7264711979 mapped: 1564606464 unmapped: 47874048
> heap: 1612480512 old mem: 5369698813 new mem: 5369698813
>      -6> 2021-03-09T14:10:31.105+0100 7fc128e05700 10 monclient: tick
>      -5> 2021-03-09T14:10:31.105+0100 7fc128e05700 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after
> 2021-03-09T14:10:01.107451+0100)
>      -4> 2021-03-09T14:10:31.574+0100 7fc1374bf700 10 monclient:
> handle_auth_request added challenge on 0x55c0b69a8000
>      -3> 2021-03-09T14:10:32.036+0100 7fc12ca33700  5 prioritycache
> tune_memory target: 7264711979 mapped: 1708449792 unmapped: 46637056
> heap: 1755086848 old mem: 5369698813 new mem: 5369698813
>      -2> 2021-03-09T14:10:32.106+0100 7fc128e05700 10 monclient: tick
>      -1> 2021-03-09T14:10:32.106+0100 7fc128e05700 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after
> 2021-03-09T14:10:02.107524+0100)
>       0> 2021-03-09T14:10:32.661+0100 7fc1384c1700 -1 *** Caught signal
> (Aborted) **
>   in thread 7fc1384c1700 thread_name:msgr-worker-0
>
>   ceph version 15.2.9 (357616cbf726abb779ca75a551e8d02568e15b17) octopus
> (stable)
>   1: (()+0x12b20) [0x7fc13cc20b20]
>   2: (gsignal()+0x10f) [0x7fc13b8847ff]
>   3: (abort()+0x127) [0x7fc13b86ec35]
>   4: (()+0x9009b) [0x7fc13c23a09b]
>   5: (()+0x9653c) [0x7fc13c24053c]
>   6: (()+0x96597) [0x7fc13c240597]
>   7: (()+0x967f8) [0x7fc13c2407f8]
>   8: (ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int,
> unsigned int, int)+0x229) [0x55c0669dce49]
>   9: (ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned
> int)+0x26) [0x55c0669dcf46]
>   10: (ceph::buffer::v15_2_0::create_small_page_aligned(unsigned
> int)+0x55) [0x55c0669dd8e5]
>   11: (ProtocolV1::read_message_data_prepare()+0x368) [0x55c066b78ac8]
>   12: (ProtocolV1::read_message_middle()+0x130) [0x55c066b78c90]
>   13: (ProtocolV1::handle_message_front(char*, int)+0x2f4) [0x55c066b79624]
>   14: (()+0xf72bed) [0x55c066b72bed]
>   15: (AsyncConnection::process()+0x8a9) [0x55c066b6fa39]
>   16: (EventCenter::process_events(unsigned int,
> std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l>
>  >*)+0xcb7) [0x55c0669c41b7]
>   17: (()+0xdc979c) [0x55c0669c979c]
>   18: (()+0xc2ba3) [0x7fc13c26cba3]
>   19: (()+0x814a) [0x7fc13cc1614a]
>   20: (clone()+0x43) [0x7fc13b949f23]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
>
> --- logging levels ---
>     0/ 5 none
>     0/ 1 lockdep
>     0/ 1 context
>     1/ 1 crush
>     1/ 5 mds
>     1/ 5 mds_balancer
>     1/ 5 mds_locker
>     1/ 5 mds_log
>     1/ 5 mds_log_expire
>     1/ 5 mds_migrator
>     0/ 1 buffer
>     0/ 1 timer
>     0/ 1 filer
>     0/ 1 striper
>     0/ 1 objecter
>     0/ 5 rados
>     0/ 5 rbd
>     0/ 5 rbd_mirror
>     0/ 5 rbd_replay
>     0/ 5 rbd_rwl
>     0/ 5 journaler
>     0/ 5 objectcacher
>     0/ 5 immutable_obj_cache
>     0/ 5 client
>     1/ 5 osd
>     0/ 5 optracker
>     0/ 5 objclass
>     1/ 3 filestore
>     1/ 3 journal
>     0/ 0 ms
>     1/ 5 mon
>     0/10 monc
>     1/ 5 paxos
>     0/ 5 tp
>     1/ 5 auth
>     1/ 5 crypto
>     1/ 1 finisher
>     1/ 1 reserver
>     1/ 5 heartbeatmap
>     1/ 5 perfcounter
>     1/ 5 rgw
>     1/ 5 rgw_sync
>     1/10 civetweb
>     1/ 5 javaclient
>     1/ 5 asok
>     1/ 1 throttle
>     0/ 0 refs
>     1/ 5 compressor
>     1/ 5 bluestore
>     1/ 5 bluefs
>     1/ 3 bdev
>     1/ 5 kstore
>     4/ 5 rocksdb
>     4/ 5 leveldb
>     4/ 5 memdb
>     1/ 5 fuse
>     1/ 5 mgr
>     1/ 5 mgrc
>     1/ 5 dpdk
>     1/ 5 eventtrace
>     1/ 5 prioritycache
>     0/ 5 test
>    -2/-2 (syslog threshold)
>    -1/-1 (stderr threshold)
> --- pthread ID / name mapping for recent threads ---
>    7fc119412700 / osd_srv_heartbt
>    7fc119c13700 / tp_osd_tp
>    7fc11a414700 / tp_osd_tp
>    7fc11ac15700 / tp_osd_tp
>    7fc11b416700 / tp_osd_tp
>    7fc11bc17700 / tp_osd_tp
>    7fc124c29700 / ms_dispatch
>    7fc125c2b700 / rocksdb:dump_st
>    7fc12822a700 / bstore_kv_sync
>    7fc128e05700 / safe_timer
>    7fc129e07700 / ms_dispatch
>    7fc12ca33700 / bstore_mempool
>    7fc133446700 / safe_timer
>    7fc1374bf700 / msgr-worker-2
>    7fc137cc0700 / msgr-worker-1
>    7fc1384c1700 / msgr-worker-0
>    max_recent     10000
>    max_new         1000
>
> --
> _____________________________________________________________
>     prof. dr. Andrej Filipcic,   E-mail: Andrej.Filipcic@xxxxxx
>     Department of Experimental High Energy Physics - F9
>     Jozef Stefan Institute, Jamova 39, P.o.Box 3000
>     SI-1001 Ljubljana, Slovenia
>     Tel.: +386-1-477-3674    Fax: +386-1-425-7074
> -------------------------------------------------------------
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux