Tried
connecting recovered osd. Looks
like some of the files in the
lost+found are super blocks.
Below is the log. What can I
do about this?
2017-09-01
22:27:27.634228 7f68837e5800 0
set uid:gid to 1001:1001
(ceph:ceph)
2017-09-01
22:27:27.634245 7f68837e5800 0
ceph version 10.2.9
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0),
process ceph-osd, pid 5432
2017-09-01
22:27:27.635456 7f68837e5800 0
pidfile_write: ignore empty
--pid-file
2017-09-01
22:27:27.646849 7f68837e5800 0
filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342)
2017-09-01
22:27:27.647077 7f68837e5800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
FIEMAP ioctl is disabled via
'filestore fiemap' config option
2017-09-01
22:27:27.647080 7f68837e5800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
SEEK_DATA/SEEK_HOLE is disabled
via 'filestore seek data hole'
config option
2017-09-01
22:27:27.647091 7f68837e5800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
splice is supported
2017-09-01
22:27:27.678937 7f68837e5800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
syncfs(2) syscall fully
supported (by glibc and kernel)
2017-09-01
22:27:27.679044 7f68837e5800 0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: extsize is
disabled by conf
2017-09-01
22:27:27.680718 7f68837e5800 1
leveldb: Recovering log #28054
2017-09-01
22:27:27.804501 7f68837e5800 1
leveldb: Delete type=0 #28054
2017-09-01
22:27:27.804579 7f68837e5800 1
leveldb: Delete type=3 #28053
2017-09-01
22:27:35.586725 7f68837e5800 0
filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2017-09-01
22:27:35.587689 7f68837e5800 1
journal _open
/var/lib/ceph/osd/ceph-0/journal
fd 18: 9998729216 bytes, block
size 4096 bytes, directio = 1,
aio = 1
2017-09-01
22:27:35.589631 7f68837e5800 1
journal _open
/var/lib/ceph/osd/ceph-0/journal
fd 18: 9998729216 bytes, block
size 4096 bytes, directio = 1,
aio = 1
2017-09-01
22:27:35.590041 7f68837e5800 1
filestore(/var/lib/ceph/osd/ceph-0) upgrade
2017-09-01
22:27:35.590149 7f68837e5800 -1
filestore(/var/lib/ceph/osd/ceph-0) could not find
#-1:7b3f43c4:::osd_superblock:0#
in index: (2) No such file or
directory
2017-09-01
22:27:35.590158 7f68837e5800 -1
osd.0 0 OSD::init() : unable to
read osd superblock
2017-09-01
22:27:35.590547 7f68837e5800 1
journal close
/var/lib/ceph/osd/ceph-0/journal
2017-09-01
22:27:35.611595 7f68837e5800 -1
^[[0;31m ** ERROR: osd init
failed: (22) Invalid
argument^[[0m
Recovered
drive is mounted on
/var/lib/ceph/osd/ceph-0.
#
df
Filesystem
1K-blocks Used
Available Use% Mounted on
udev
10240 0
10240 0% /dev
tmpfs
1584780 9172
1575608 1% /run
/dev/sda1
15247760 9319048
5131120 65% /
tmpfs
3961940 0
3961940 0% /dev/shm
tmpfs
5120 0
5120 0% /run/lock
tmpfs
3961940 0
3961940 0% /sys/fs/cgroup
/dev/sdb1
1952559676 634913968
1317645708 33%
/var/lib/ceph/osd/ceph-0
/dev/sde1
1952559676 640365952
1312193724 33%
/var/lib/ceph/osd/ceph-6
/dev/sdd1
1952559676 712018768
1240540908 37%
/var/lib/ceph/osd/ceph-2
/dev/sdc1
1952559676 755827440
1196732236 39%
/var/lib/ceph/osd/ceph-1
/dev/sdf1
312417560 42538060
269879500 14%
/var/lib/ceph/osd/ceph-7
tmpfs
792392 0
792392 0% /run/user/0
#
cd /var/lib/ceph/osd/ceph-0
#
ls
activate.monmap
current journal_uuid magic
superblock whoami
active
fsid keyring
ready sysvinit
ceph_fsid
journal lost+found
store_version type
Regards,
Hong
Found the
partition,
wasn't able to
mount the
partition
right away...
Did a
xfs_repair on
that drive.
Got bunch of
messages like
this.. =(
entry
"100000a89fd.00000000__head_AE319A25__0" in shortform directory
845908970
references
non-existent
inode 605294241
junking
entry
"100000a89fd.00000000__head_AE319A25__0"
in directory
inode 845908970
Was
able to mount.
lost+found has
lots of files
there. =P
Running du
seems to show OK
files in current
directory.
Will
it be safe to
attach this one
back to the
cluster? Is
there a way to
specify to use
this drive if
the data is
missing? =) Or
am I being
paranoid? Just
plug it? =)
Regards,
Hong
Looks
like it has
been
rescued...
Only 1 error
as we saw
before in the
smart log!
#
ddrescue -f
/dev/sda
/dev/sdc
./rescue.log
GNU
ddrescue 1.21
Press
Ctrl-C to
interrupt
ipos:
1508 GB,
non-trimmed:
0 B,
current rate:
0 B/s
opos:
1508 GB,
non-scraped:
0 B,
average rate:
88985 kB/s
non-tried:
0 B,
errsize:
4096 B,
run time: 6h
14m 40s
rescued:
2000 GB,
errors:
1, remaining
time:
n/a
percent
rescued:
99.99%
time since
last
successful
read:
39s
Finished
Still missing
partition in
the new drive.
=P I found
this util
called
testdisk for
broken
partition
tables. Will
try that
tonight. =P
Regards,
Hong
On
30.08.2017
15:32, Steve
Taylor wrote:
I'm not
familiar with
dd_rescue, but
I've just been
reading about
it. I'm not
seeing any
features that
would be
beneficial in
this scenario
that aren't
also available
in dd. What
specific
features give
it "really a
far better
chance
of restoring a copy of your disk"
than dd? I'm
always
interested in
learning about
new recovery
tools.
i see i wrote
dd_rescue from
old habit, but
the package
one should use
on debian is
gddrescue or
also called
gnu ddrecue.
this page have
some details
on the
differences on
dd vs the
ddrescue
variants.
http://www.toad.com/gnu/sysadmin/index.html#ddrescue
kind regards
Ronny Aasen
If you are not the intended recipient of this message or
received it
erroneously,
please notify
the sender and
delete it,
together with
any
attachments,
and be advised
that any
dissemination
or copying of
this message
is prohibited. |
On Tue,
2017-08-29 at
21:49 +0200,
Willem Jan
Withagen
wrote:
On 29-8-2017 19:12, Steve Taylor wrote:
Hong,
Probably your
best chance at
recovering any
data without
special,
expensive,
forensic
procedures is
to perform a
dd from
/dev/sdb to
somewhere else
large enough
to hold a full
disk image and
attempt to
repair that.
You'll want to
use
'conv=noerror'
with your dd
command
since your
disk is
failing. Then
you could
either
re-attach the
OSD
from the new
source or
attempt to
retrieve
objects from
the filestore
on it.
Like somebody
else already
pointed out
In problem
"cases like
disk, use
dd_rescue.
It has really
a far better
chance of
restoring a
copy of your
disk
--WjW
I have
actually done
this before by
creating an
RBD that
matches the
disk size,
performing the
dd, running
xfs_repair,
and eventually
adding it back
to the cluster
as an OSD.
RBDs as OSDs
is certainly a
temporary
arrangement
for repair
only, but I'm
happy to
report that it
worked
flawlessly in
my case. I was
able to weight
the OSD to 0,
offload all of
its data, then
remove it for
a full
recovery, at
which
point I just
deleted the
RBD.
The
possibilities
afforded by
Ceph inception
are endless. ☺
Steve Taylor |
Senior
Software
Engineer |
StorageCraft
Technology
Corporation
380 Data Drive
Suite 300
| Draper |
Utah | 84020
Office:
801.871.2799 |
If you are not
the intended
recipient of
this message
or received it
erroneously,
please notify
the sender and
delete it,
together with
any
attachments,
and be advised
that any
dissemination
or copying of
this message
is prohibited.
On Mon,
2017-08-28 at
23:17 +0100,
Tomasz
Kusmierz
wrote:
Rule of thumb
with batteries
is:
- more “proper
temperature”
you run them
at the more
life you get
out
of them
- more battery
is overpowered
for your
application
the longer it
will
survive.
Get your self
a LSI 94**
controller and
use it as HBA
and you will
be
fine. but get
MORE DRIVES
!!!!! …
On 28 Aug
2017, at
23:10,
hjcho616 <hjcho616@xxxxxxxxx>
wrote:
Thank you
Tomasz and
Ronny. I'll
have to order
some hdd soon
and
try these out.
Car battery
idea is nice!
I may try
that.. =) Do
they last
longer? Ones
that fit the
UPS original
battery spec
didn't last
very long...
part of the
reason why I
gave up on
them..
=P My wife
probably won't
like the idea
of car battery
hanging out
though ha!
The OSD1 (one
with mostly ok
OSDs, except
that smart
failure)
motherboard
doesn't have
any additional
SATA
connectors
available.
Would it be
safe to add
another OSD
host?
Regards,
Hong
On Monday,
August 28,
2017 4:43 PM,
Tomasz
Kusmierz <tom.kusmierz@g
mail.com>
wrote:
Sorry for
being brutal …
anyway
1. get the
battery for
UPS ( a car
battery will
do as well,
I’ve
moded on ups
in the past
with truck
battery and it
was working
like
a charm :D )
2. get spare
drives and put
those in
because your
cluster CAN
NOT
get out of
error due to
lack of space
3. Follow
advice of
Ronny Aasen on
hot to recover
data from hard
drives
4 get cooling
to drives or
you will loose
more !
On 28 Aug
2017, at
22:39,
hjcho616 <hjcho616@xxxxxxxxx>
wrote:
Tomasz,
Those machines
are behind a
surge
protector.
Doesn't
appear to
be a good one!
I do have a
UPS... but it
is my fault...
no
battery.
Power was
pretty
reliable for a
while... and
UPS was
just beeping
every chance
it had,
disrupting
some sleep..
=P So
running on
surge
protector
only. I am
running this
in home
environment.
So far, HDD
failures have
been very rare
for this
environment.
=) It just
doesn't get
loaded as
much! I am
not
sure what to
expect, seeing
that "unfound"
and just a
feeling of
possibility of
maybe getting
OSD back made
me excited
about it.
=) Thanks for
letting me
know what
should be the
priority. I
just lack
experience and
knowledge in
this. =)
Please do
continue
to guide me
though this.
Thank you for
the decode of
that smart
messages! I
do agree that
looks like it
is on its way
out. I would
like to know
how to get
good portion
of it back if
possible. =)
I think I just
set the size
and min_size
to 1.
# ceph osd
lspools
0 data,1
metadata,2
rbd,
# ceph osd
pool set rbd
size 1
set pool 2
size to 1
# ceph osd
pool set rbd
min_size 1
set pool 2
min_size to 1
Seems to be
doing some
backfilling
work.
# ceph health
HEALTH_ERR 22
pgs are stuck
inactive for
more than 300
seconds; 2
pgs
backfill_toofull;
74 pgs
backfill_wait;
3 pgs
backfilling;
108 pgs
degraded; 6
pgs down; 6
pgs
inconsistent;
6 pgs peering;
7 pgs
recovery_wait;
16 pgs stale;
108 pgs stuck
degraded; 6
pgs
stuck
inactive; 16
pgs stuck
stale; 130 pgs
stuck unclean;
101
pgs stuck
undersized;
101 pgs
undersized; 1
requests are
blocked
32 sec;
recovery
1790657/4502340
objects
degraded
(39.772%);
recovery
641906/4502340
objects
misplaced
(14.257%);
recovery
147/2251990
unfound
(0.007%); 50
scrub errors;
mds cluster is
degraded; no
legacy OSD
present but
'sortbitwise'
flag is not
set
Regards,
Hong
On Monday,
August 28,
2017 4:18 PM,
Tomasz
Kusmierz
<tom.kusmierz
@gmail.com>
wrote:
So to decode
few things
about your
disk:
1
Raw_Read_Error_Rate
0x002f 100
100 051
Pre-fail
Always -
37
37 read erros
and only one
sector marked
as pending -
fun disk
:/
181
Program_Fail_Cnt_Total
0x0022 099
099 000
Old_age
Always -
35325174
So firmware
has quite few
bugs, that’s
nice
191
G-Sense_Error_Rate
0x0022
100 100 000
Old_age
Always -
2855
disk was
thrown around
while
operational
even more
nice.
194
Temperature_Celsius
0x0002 047
041 000
Old_age
Always -
53
(Min/Max
15/59)
if your disk
passes 50 you
should not
consider using
it, high
temperatures
demagnetise
plate layer
and you will
see more
errors
in very near
future.
197
Current_Pending_Sector
0x0032 100
100 000
Old_age
Always -
1
as mentioned
before :)
200
Multi_Zone_Error_Rate
0x002a 100
100 000
Old_age
Always -
4222
your heads
keep missing
tracks … bent
? I don’t even
know how to
comment here.
generally fun
drive you’ve
got there …
rescue as much
as you can
and throw it
away !!!
_______________________________________________
ceph-users
mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com