Dear All,
We are having problems with a critical osd crashing on a Nautilus
(14.2.8) cluster.
This is a critical failure, as the osd is part of a pg that is otherwise
"down+remapped" due to other osd's crashing; we were hoping the pg was
going to repair itself, as there are plenty of free osds, but for some
reason this pg never managed to get out of an undersized state.
The osd starts OK, runs for a few minutes, then crashes with an assert,
immediately after trying to backfill the pg that is "down+remapped"
-7> 2020-03-23 15:28:15.368 7f15aeea8700 5 osd.287 pg_epoch: 35531
pg[5.750s2( v 35398'3381328 (35288'3378238,35398'3381328]
local-lis/les=35530/35531 n=190408 ec=1821/1818 lis/c 35530/22903
les/c/f 35531/22917/0 35486/35530/35530)
[234,354,304,388,125,25,427,226,77,154]/[2147483647,2147483647,287,388,125,25,427,226,77,154]p287(2)
backfill=[234(0),304(2),354(1)] r=2 lpr=35530 pi=[22903,35530)/9 rops=1
crt=35398'3381328 lcod 0'0 mlcod 0'0
active+undersized+degraded+remapped+backfilling mbc={} trimq=112 ps=121]
backfill_pos is 5:0ae00653:::1000e49a8c6.000000d3:head
-6> 2020-03-23 15:28:15.381 7f15cc9ec700 10 monclient:
get_auth_request con 0x555b2f229800 auth_method 0
-5> 2020-03-23 15:28:15.381 7f15b86bb700 2 osd.287 35531
ms_handle_reset con 0x555b2fef7400 session 0x555b2f363600
-4> 2020-03-23 15:28:15.391 7f15c04c5700 5 prioritycache
tune_memory target: 4294967296 mapped: 805339136 unmapped: 1032192 heap:
806371328 old mem: 2845415832 new mem: 2845415832
-3> 2020-03-23 15:28:15.420 7f15cc9ec700 10 monclient:
get_auth_request con 0x555b2fef7800 auth_method 0
-2> 2020-03-23 15:28:15.420 7f15b86bb700 2 osd.287 35531
ms_handle_reset con 0x555b2fef7c00 session 0x555b2f363c00
-1> 2020-03-23 15:28:15.476 7f15aeea8700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/osd/osd_types.cc:
In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread
7f15aeea8700 time 2020-03-23 15:28:15.470166
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/osd/osd_types.cc:
5443: FAILED ceph_assert(clone_size.count(clone))
osd log (127KB) is here:
<https://www.mrc-lmb.cam.ac.uk/scicomp/ceph-osd.287.log.gz>
/var/log/ceph/ceph-osd.287.log.gz
when the osd was running, the pg state is as follows
[root@ceph7 ~]# ceph pg dump | grep ^5.750
5.750 190408 0 804 190119 0
569643615603 0 0 3090 3090
active+undersized+degraded+remapped+backfill_wait 2020-03-23
14:37:57.582509 35398'3381328 35491:3265627
[234,354,304,388,125,25,427,226,77,154] 234
[NONE,NONE,287,388,125,25,427,226,77,154] 287 24471'3200829
2020-01-28 15:48:35.574934 24471'3200829 2020-01-28
15:48:35.574934 112
with the osd down:
[root@ceph7 ~]# ceph pg dump | grep ^5.750
dumped all
5.750 190408 0 0 0 0
569643615603 0 0 3090
3090 down+remapped 2020-03-23
15:28:28.345176 35398'3381328 35532:3265613
[234,354,304,388,125,25,427,226,77,154] 234
[NONE,NONE,NONE,388,125,25,427,226,77,154] 388 24471'3200829
2020-01-28 15:48:35.574934 24471'3200829 2020-01-28 15:48:35.574934
This cluster is being used to backup a live cephfs cluster and has 1.8PB
of data, including 30 days of snapshots. We are using 8+2 EC.
Any help appreciated,
Jake
Note: I am working from home until further notice.
For help, contact unixadmin@xxxxxxxxxxxxxxxxx
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx