Re: OSDs in EC pool flapping

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Have you experienced huge memory consumption by flapping OSD daemons? Restart could be triggered by no memory (omkiller).

If yes,this could be connected with osd device error,(bad blocks?), but we've experienced something similar on Jewel, not Kraken release. Solution was to find PG that cause error, set it to deep scrub manually and restart PG's primary OSD.

Hope that helps, or at least lead to some solution.

Dnia 22 sierpnia 2017 18:39:47 CEST, george.vasilakakos@xxxxxxxxxx napisał(a):
Hey folks,


I'm staring at a problem that I have found no solution for and which is causing major issues.
We've had a PG go down with the first 3 OSDs all crashing and coming back only to crash again with the following error in their logs:

-1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946 pg[1.138s0( v 72946'430011 (62760'421568,72
946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0 72942/72944/72944) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 lpr=72944 pi=72880-72943/24 bft=1513(7) crt=
72946'430011 lcod 72889'430010 mlcod 72889'430010 active+undersized+degraded+remapped+backfilling] recover_replicas: ob
ject added to missing set for backfill, but is not in recovering, error!
0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal (Aborted) **
in thread 7f4af4057700 thread_name:tp_osd_tp

This has been going on over the weekend when we saw a different error message before upgrading from 11.2.0 to 11.2.1.
The pool is running EC 8+3.

The OSDs crash with that error only to be restarted by systemd and fail again the exact same way. Eventually systemd gives, the mon_osd_down_out_interval expires and the PG just stays down+remapped while other recover and go active+clean.

Can anybody help with this type of problem?


Best regards,

George Vasilakakos


ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Paweł Woszuk
PCSS, Poznańskie Centrum Superkomputerowo-Sieciowe
ul. Jana Pawła II nr 10, 61-139 Poznań
Polska
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux