Re: Power outages!!! help!

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Mon, 28 Aug 2017 11:32:17 +0200

I would suggest either adding 1 new disk on each of the 2 machines increasing the osd_backfill_full_ratio to something like 90 or 92 from default 85.
/Maged 

On 2017-08-28 08:01, hjcho616 wrote:

Hello!

I've been using ceph for long time mostly for network CephFS storage, even before Argonaut release!  It's been working very well for me.  Yes, I had some power outtages before and asked few questions on this list before and got resolved happily!  Thank you all!

Not sure why but we've been having quite a bit of power outages lately.  Ceph appear to be running OK with those going on.. so I was pretty happy and didn't thought much of it... till yesterday, When I started to move some videos to cephfs, ceph decided that it was full although df showed only 54% utilization!  Then I looked up, some of the osds were down! (only 3 at that point!)

I am running pretty simple ceph configuration... I have one machine running MDS and mon named MDS1.  Two OSD machines with 5 2TB HDDs and 1 SSD for journal named OSD1 and OSD2.

At the time, I was running jewel 10.2.2. I looked at some of downed OSD's log file and googled some of them... they appeared to be tied to version 10.2.2.  So I just upgraded all to 10.2.9.  Well that didn't solve my problems.. =P  While looking at some of this.. there was another power outage!  D'oh!  I may need to invest in a UPS or something... Until this happened, all of the osd down were from OSD2.  But OSD1 took a hit!  Couldn't boot, because osd-0 was damaged... I tried xfs_repair -L /dev/sdb1 as suggested by command line.. I was able to mount it again, phew, reboot... then /dev/sdb1 is no longer accessible!  Noooo!!!

So this is what I have today!  I am a bit concerned as half of the osds are down!  and osd.0 doesn't look good at all...
# ceph osd tree
ID WEIGHT   TYPE NAME     UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 16.24478 root default
-2  8.12239     host OSD1
 1  1.95250         osd.1      up  1.00000          1.00000
 0  1.95250         osd.0    down        0          1.00000
 7  0.31239         osd.7      up  1.00000          1.00000
 6  1.95250         osd.6      up  1.00000          1.00000
 2  1.95250         osd.2      up  1.00000          1.00000
-3  8.12239     host OSD2
 3  1.95250         osd.3    down        0          1.00000
 4  1.95250         osd.4    down        0          1.00000
 5  1.95250         osd.5    down        0          1.00000
 8  1.95250         osd.8    down        0          1.00000
 9  0.31239         osd.9      up  1.00000          1.00000

This looked alot better before that last extra power outage... =(  Can't mount it anymore!
# ceph health
HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 44 pgs backfill_toofull; 80 pgs backfill_wait; 122 pgs degraded; 6 pgs down; 8 pgs inconsistent; 6 pgs peering; 2 pgs recovering; 18 pgs recovery_wait; 16 pgs stale; 122 pgs stuck degraded; 6 pgs stuck inactive; 16 pgs stuck stale; 159 pgs stuck unclean; 102 pgs stuck undersized; 102 pgs undersized; 1 requests are blocked > 32 sec; recovery 1803466/4503980 objects degraded (40.042%); recovery 692976/4503980 objects misplaced (15.386%); recovery 147/2251990 unfound (0.007%); 1 near full osd(s); 54 scrub errors; mds cluster is degraded; no legacy OSD present but 'sortbitwise' flag is not set

Each of osds are showing different failure signature. 

I've uploaded osd log with debug osd = 20, debug filestore = 20, and debug ms = 20.  You can find it in below links.  Let me know if there is preferred way to share this!
https://drive.google.com/open?id=0By7YztAJNGUWQXItNzVMR281Snc (ceph-osd.3.log)
https://drive.google.com/open?id=0By7YztAJNGUWYmJBb3RvLVdSQWc (ceph-osd.4.log)
https://drive.google.com/open?id=0By7YztAJNGUWaXhRMlFOajN6M1k (ceph-osd.5.log)
https://drive.google.com/open?id=0By7YztAJNGUWdm9BWFM5a3ExOFE (ceph-osd.8.log)

So how does this look?  Can this be fixed? =)  If so please let me know.  I used to take backups but since it grew so big, I wasn't able to do so anymore... and would like to get most of these back if I can.  Please let me know if you need more info!

Thank you!

Regards,
Hong

_______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com