Re: OSD crashes create_aligned_in_mempool in 15.2.9 and 14.2.16

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thank you for confirmation. Hopefully it will be approved in bodhi
(you can leave feedback here:
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2021-0eda4297eb
to help it along) soon, and new docker images can be built with the
older version.

On Tue, Mar 9, 2021 at 10:57 AM Andrej Filipcic <andrej.filipcic@xxxxxx> wrote:
>
>
> just confirming, crashes are gone with gperftools-libs-2.7-8.el8.x86_64.rpm
>
> Cheers,
> Andrej
>
> On 09/03/2021 16:52, Andrej Filipcic wrote:
> >
> > Hi,
> >
> > I was checking that bug yesterday, yes, and it smells the same.
> >
> > I will give a try to the epel one,
> >
> > Thanks
> > Andrej
> >
> > On 09/03/2021 16:44, Dan van der Ster wrote:
> >> Hi Andrej,
> >>
> >> I wonder if this is another manifestation of the buggy gperftools-libs
> >> v2.8 bug, e.g. https://tracker.ceph.com/issues/49618
> >>
> >> If so, there is a fixed (downgraded) version in epel-testing now.
> >>
> >> Cheers, Dan
> >>
> >>
> >>
> >>
> >> On Tue, Mar 9, 2021 at 4:36 PM Andrej Filipcic
> >> <andrej.filipcic@xxxxxx> wrote:
> >>>
> >>> Hi,
> >>>
> >>> under heavy load our cluster is experiencing frequent OSD crashes. Is
> >>> this a known bug or should I report it? Any workarounds? It looks to be
> >>> highly correlated with memory tuning.
> >>>
> >>> it happens with both nautilus 14.2.16 and octopus 15.2.9. I have forced
> >>> the bitmap bluefs and bluestore allocator.
> >>>
> >>> the cluster is ~60 nodes with 256GB ram and 25Gb NICs, ~1600 OSDs on
> >>> 100g network. Typically it is happening when the traffic is above
> >>> 100GB/s.
> >>>
> >>> Best regards,
> >>> Andrej
> >>>
> >>>
> >>>      -14> 2021-03-09T14:10:30.105+0100 7fc128e05700 10 monclient: tick
> >>>      -13> 2021-03-09T14:10:30.105+0100 7fc128e05700 10 monclient:
> >>> _check_auth_rotating have uptodate secrets (they expire after
> >>> 2021-03-09T14:10:00.107344+0100)
> >>>      -12> 2021-03-09T14:10:30.210+0100 7fc119412700  5 osd.209 9539
> >>> heartbeat osd_stat(store_statfs(0xe68762a0000/0x40000000/0xe8d7fc00000,
> >>> data 0x24b195d923/0x24c97c0000, compress 0x0/0x0/0x0, omap 0xf5dca,
> >>> meta
> >>> 0x3ff0a236), peers
> >>> [6,8,11,21,22,23,24,25,26,27,28,29,34,35,37,38,65,86,87,90,120,128,129,135,136,140,150,153,154,160,184,188,192,193,203,208,210,217,229,233,242,248,254,25
> >>>
> >>> 6,275,277,282,290,311,313,324,326,331,339,348,369,409,411,413,466,477,532,535,538,539,542,544,546,548,552,554,556,558,561,576,580,600,601,604,612,614,624,625,631,657,689,695,704,717,738,739,740,766,790,810,812,833,839,890,891,895,903,909,916,926,927,946,960,965,991,1050,1055,1062,1064,1067,1069,1072,1073,1075,1077,1078,1079,1095,1100,1117,1125,1127,1141,1148,1149,1153,1155,1195,12
> >>>
> >>> 01,1202,1215,1229,1238,1253,1260,1283,1290,1298,1303,1329,1330,1349,1350,1388,1389,1422,1423,1430,1431,1434,1448,1455,1478,1479,1485,1488,1494,1497,1506,1516,1561,1564,1573,1574,1580]
> >>>
> >>> op hist [0,0,0,1,3,4,15,24,43,64,102,117])
> >>>      -11> 2021-03-09T14:10:30.468+0100 7fc137cc0700 10 monclient:
> >>> handle_auth_request added challenge on 0x55c08becf800
> >>>      -10> 2021-03-09T14:10:30.543+0100 7fc1374bf700 10 monclient:
> >>> handle_auth_request added challenge on 0x55c08becf400
> >>>       -9> 2021-03-09T14:10:30.712+0100 7fc1384c1700 10 monclient:
> >>> handle_auth_request added challenge on 0x55c08becec00
> >>>       -8> 2021-03-09T14:10:31.029+0100 7fc137cc0700 10 monclient:
> >>> handle_auth_request added challenge on 0x55c08becfc00
> >>>       -7> 2021-03-09T14:10:31.033+0100 7fc12ca33700  5 prioritycache
> >>> tune_memory target: 7264711979 mapped: 1564606464 unmapped: 47874048
> >>> heap: 1612480512 old mem: 5369698813 new mem: 5369698813
> >>>       -6> 2021-03-09T14:10:31.105+0100 7fc128e05700 10 monclient: tick
> >>>       -5> 2021-03-09T14:10:31.105+0100 7fc128e05700 10 monclient:
> >>> _check_auth_rotating have uptodate secrets (they expire after
> >>> 2021-03-09T14:10:01.107451+0100)
> >>>       -4> 2021-03-09T14:10:31.574+0100 7fc1374bf700 10 monclient:
> >>> handle_auth_request added challenge on 0x55c0b69a8000
> >>>       -3> 2021-03-09T14:10:32.036+0100 7fc12ca33700  5 prioritycache
> >>> tune_memory target: 7264711979 mapped: 1708449792 unmapped: 46637056
> >>> heap: 1755086848 old mem: 5369698813 new mem: 5369698813
> >>>       -2> 2021-03-09T14:10:32.106+0100 7fc128e05700 10 monclient: tick
> >>>       -1> 2021-03-09T14:10:32.106+0100 7fc128e05700 10 monclient:
> >>> _check_auth_rotating have uptodate secrets (they expire after
> >>> 2021-03-09T14:10:02.107524+0100)
> >>>        0> 2021-03-09T14:10:32.661+0100 7fc1384c1700 -1 *** Caught
> >>> signal
> >>> (Aborted) **
> >>>    in thread 7fc1384c1700 thread_name:msgr-worker-0
> >>>
> >>>    ceph version 15.2.9 (357616cbf726abb779ca75a551e8d02568e15b17)
> >>> octopus
> >>> (stable)
> >>>    1: (()+0x12b20) [0x7fc13cc20b20]
> >>>    2: (gsignal()+0x10f) [0x7fc13b8847ff]
> >>>    3: (abort()+0x127) [0x7fc13b86ec35]
> >>>    4: (()+0x9009b) [0x7fc13c23a09b]
> >>>    5: (()+0x9653c) [0x7fc13c24053c]
> >>>    6: (()+0x96597) [0x7fc13c240597]
> >>>    7: (()+0x967f8) [0x7fc13c2407f8]
> >>>    8: (ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int,
> >>> unsigned int, int)+0x229) [0x55c0669dce49]
> >>>    9: (ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned
> >>> int)+0x26) [0x55c0669dcf46]
> >>>    10: (ceph::buffer::v15_2_0::create_small_page_aligned(unsigned
> >>> int)+0x55) [0x55c0669dd8e5]
> >>>    11: (ProtocolV1::read_message_data_prepare()+0x368) [0x55c066b78ac8]
> >>>    12: (ProtocolV1::read_message_middle()+0x130) [0x55c066b78c90]
> >>>    13: (ProtocolV1::handle_message_front(char*, int)+0x2f4)
> >>> [0x55c066b79624]
> >>>    14: (()+0xf72bed) [0x55c066b72bed]
> >>>    15: (AsyncConnection::process()+0x8a9) [0x55c066b6fa39]
> >>>    16: (EventCenter::process_events(unsigned int,
> >>> std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l>
> >>>   >*)+0xcb7) [0x55c0669c41b7]
> >>>    17: (()+0xdc979c) [0x55c0669c979c]
> >>>    18: (()+0xc2ba3) [0x7fc13c26cba3]
> >>>    19: (()+0x814a) [0x7fc13cc1614a]
> >>>    20: (clone()+0x43) [0x7fc13b949f23]
> >>>    NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> >>> needed to interpret this.
> >>>
> >>>
> >>> --- logging levels ---
> >>>      0/ 5 none
> >>>      0/ 1 lockdep
> >>>      0/ 1 context
> >>>      1/ 1 crush
> >>>      1/ 5 mds
> >>>      1/ 5 mds_balancer
> >>>      1/ 5 mds_locker
> >>>      1/ 5 mds_log
> >>>      1/ 5 mds_log_expire
> >>>      1/ 5 mds_migrator
> >>>      0/ 1 buffer
> >>>      0/ 1 timer
> >>>      0/ 1 filer
> >>>      0/ 1 striper
> >>>      0/ 1 objecter
> >>>      0/ 5 rados
> >>>      0/ 5 rbd
> >>>      0/ 5 rbd_mirror
> >>>      0/ 5 rbd_replay
> >>>      0/ 5 rbd_rwl
> >>>      0/ 5 journaler
> >>>      0/ 5 objectcacher
> >>>      0/ 5 immutable_obj_cache
> >>>      0/ 5 client
> >>>      1/ 5 osd
> >>>      0/ 5 optracker
> >>>      0/ 5 objclass
> >>>      1/ 3 filestore
> >>>      1/ 3 journal
> >>>      0/ 0 ms
> >>>      1/ 5 mon
> >>>      0/10 monc
> >>>      1/ 5 paxos
> >>>      0/ 5 tp
> >>>      1/ 5 auth
> >>>      1/ 5 crypto
> >>>      1/ 1 finisher
> >>>      1/ 1 reserver
> >>>      1/ 5 heartbeatmap
> >>>      1/ 5 perfcounter
> >>>      1/ 5 rgw
> >>>      1/ 5 rgw_sync
> >>>      1/10 civetweb
> >>>      1/ 5 javaclient
> >>>      1/ 5 asok
> >>>      1/ 1 throttle
> >>>      0/ 0 refs
> >>>      1/ 5 compressor
> >>>      1/ 5 bluestore
> >>>      1/ 5 bluefs
> >>>      1/ 3 bdev
> >>>      1/ 5 kstore
> >>>      4/ 5 rocksdb
> >>>      4/ 5 leveldb
> >>>      4/ 5 memdb
> >>>      1/ 5 fuse
> >>>      1/ 5 mgr
> >>>      1/ 5 mgrc
> >>>      1/ 5 dpdk
> >>>      1/ 5 eventtrace
> >>>      1/ 5 prioritycache
> >>>      0/ 5 test
> >>>     -2/-2 (syslog threshold)
> >>>     -1/-1 (stderr threshold)
> >>> --- pthread ID / name mapping for recent threads ---
> >>>     7fc119412700 / osd_srv_heartbt
> >>>     7fc119c13700 / tp_osd_tp
> >>>     7fc11a414700 / tp_osd_tp
> >>>     7fc11ac15700 / tp_osd_tp
> >>>     7fc11b416700 / tp_osd_tp
> >>>     7fc11bc17700 / tp_osd_tp
> >>>     7fc124c29700 / ms_dispatch
> >>>     7fc125c2b700 / rocksdb:dump_st
> >>>     7fc12822a700 / bstore_kv_sync
> >>>     7fc128e05700 / safe_timer
> >>>     7fc129e07700 / ms_dispatch
> >>>     7fc12ca33700 / bstore_mempool
> >>>     7fc133446700 / safe_timer
> >>>     7fc1374bf700 / msgr-worker-2
> >>>     7fc137cc0700 / msgr-worker-1
> >>>     7fc1384c1700 / msgr-worker-0
> >>>     max_recent     10000
> >>>     max_new         1000
> >>>
> >>> --
> >>> _____________________________________________________________
> >>>      prof. dr. Andrej Filipcic,   E-mail: Andrej.Filipcic@xxxxxx
> >>>      Department of Experimental High Energy Physics - F9
> >>>      Jozef Stefan Institute, Jamova 39, P.o.Box 3000
> >>>      SI-1001 Ljubljana, Slovenia
> >>>      Tel.: +386-1-477-3674    Fax: +386-1-425-7074
> >>> -------------------------------------------------------------
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
>
>
> --
> _____________________________________________________________
>     prof. dr. Andrej Filipcic,   E-mail: Andrej.Filipcic@xxxxxx
>     Department of Experimental High Energy Physics - F9
>     Jozef Stefan Institute, Jamova 39, P.o.Box 3000
>     SI-1001 Ljubljana, Slovenia
>     Tel.: +386-1-477-3674    Fax: +386-1-425-7074
> -------------------------------------------------------------
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux