CEPH failuers after 5 journals down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ceph Users !

I've got here a CEPH cluster: 6 nodes, 12 OSDs on HDD and SSD disks. All journal OSDs on SSDs. 25 various HDDs in total.

We had several HDD failures in past, but every time - it was HDD failure and it was never journal related. After replacing HDD, and recovery procedures all was working again.

But now we've got double SSD failure. Two SSDs hosting journals went down, so we lost 5 journals in total (out of 12).

Then we created new journals on another HDD, and added to the cluster. CEPH started the recovery procedures and it was all looking good until 10 unfound objects were indicated. I tried to revert them by using: ceph pg <PG> mark_unfound_lost revert, but it was unsuccessful. So I deleted them. And from this moment on, two OSDs started to crash a lot:

5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x5612e2e6c946]
6: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext, std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e) [0x5612e28e48ee]
7: (ReplicatedPG::hit_set_persist()+0xd7d) [0x5612e28ead9d]

Worth to mention that cache tier is currently enabled on a separate pool.

I'm trying to flush&evict the cache, but it takes ages due to errors:

2016-12-08 11:52:00.007730 7f9e816bd700  0 -- NODE.2:0/3005344741 >> NODE.8:6802/17445 pipe(0x7f9e780161e0 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7f9e7800ecf0).fault

Every time such error happens, NODE.8 OSD goes down. I suspect that there is some inconsistency between cache and data OSD because of the SSD failure. But I guess it can't flush the cache until data gets recovered, and data can't be recovered because cache isn't flushed yet and data is inconsistent:

log_channel(cluster) log [WRN] : pg 10.33 has invalid (post-split) stats; must scrub before tier agent can activate

Any ideas ?

-- 
Wojtek
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux