Re: Cephfs unaccessible

Marco Aroldi <marco.aroldi@xxxxxxxxx> · Sun, 21 Apr 2013 19:01:25 +0200

What I can try to do/delete to regain access?
Those osd are crazy, flapping up and down. I think that the situation
is without control

HEALTH_WARN 2735 pgs backfill; 13 pgs backfill_toofull; 157 pgs
backfilling; 188 pgs degraded; 251 pgs peering; 13 pgs recovering;
1159 pgs recovery_wait; 159 pgs stuck inactive; 4641 pgs stuck
unclean; recovery 4007916/23007073 degraded (17.420%);  recovering 4
o/s, 31927KB/s; 19 near full osd(s)

2013-04-21 18:56:46.839851 mon.0 [INF] pgmap v1399007: 17280 pgs: 276
active, 12791 active+clean, 2575 active+remapped+wait_backfill, 71
active+degraded+wait_backfill, 6
active+remapped+wait_backfill+backfill_toofull, 1121
active+recovery_wait, 90 peering, 3 remapped, 1 active+remapped, 127
active+remapped+backfilling, 1 active+degraded, 5
active+remapped+backfill_toofull, 19 active+degraded+backfilling, 1
active+clean+scrubbing, 79 active+degraded+remapped+wait_backfill, 36
active+recovery_wait+remapped, 1
active+degraded+remapped+wait_backfill+backfill_toofull, 46
remapped+peering, 16 active+degraded+remapped+backfilling, 1
active+recovery_wait+degraded+remapped, 14 active+recovering; 50435 GB
data, 74790 GB used, 38642 GB / 110 TB avail; 4018849/23025448
degraded (17.454%);  recovering 14 o/s, 54732KB/s

# id    weight    type name    up/down    reweight
-1    130    root default
-9    65        room p1
-3    44            rack r14
-4    22                host s101
11    2                    osd.11    up    1
12    2                    osd.12    up    1
13    2                    osd.13    up    1
14    2                    osd.14    up    1
15    2                    osd.15    up    1
16    2                    osd.16    up    1
17    2                    osd.17    up    1
18    2                    osd.18    up    1
19    2                    osd.19    up    1
20    2                    osd.20    up    1
21    2                    osd.21    up    1
-6    22                host s102
33    2                    osd.33    up    1
34    2                    osd.34    up    1
35    2                    osd.35    up    1
36    2                    osd.36    up    1
37    2                    osd.37    up    1
38    2                    osd.38    up    1
39    2                    osd.39    up    1
40    2                    osd.40    up    1
41    2                    osd.41    up    1
42    2                    osd.42    up    1
43    2                    osd.43    up    1
-13    21            rack r10
-12    21                host s103
55    2                    osd.55    up    1
56    2                    osd.56    up    1
57    2                    osd.57    up    1
58    2                    osd.58    up    1
59    2                    osd.59    down    0
60    2                    osd.60    down    0
61    2                    osd.61    down    0
62    2                    osd.62    up    1
63    2                    osd.63    up    1
64    1.5                    osd.64    up    1
65    1.5                    osd.65    down    0
-10    65        room p2
-7    22            rack r20
-5    22                host s202
22    2                    osd.22    up    1
23    2                    osd.23    up    1
24    2                    osd.24    up    1
25    2                    osd.25    up    1
26    2                    osd.26    up    1
27    2                    osd.27    up    1
28    2                    osd.28    up    1
29    2                    osd.29    up    1
30    2                    osd.30    up    1
31    2                    osd.31    up    1
32    2                    osd.32    up    1
-8    22            rack r22
-2    22                host s201
0    2                    osd.0    up    1
1    2                    osd.1    up    1
2    2                    osd.2    up    1
3    2                    osd.3    up    1
4    2                    osd.4    up    1
5    2                    osd.5    up    1
6    2                    osd.6    up    1
7    2                    osd.7    up    1
8    2                    osd.8    up    1
9    2                    osd.9    up    1
10    2                    osd.10    up    1
-14    21            rack r21
-11    21                host s203
44    2                    osd.44    up    1
45    2                    osd.45    up    1
46    2                    osd.46    up    1
47    2                    osd.47    up    1
48    2                    osd.48    up    1
49    2                    osd.49    up    1
50    2                    osd.50    up    1
51    2                    osd.51    up    1
52    1.5                    osd.52    up    1
53    1.5                    osd.53    up    1
54    2                    osd.54    up    1

2013/4/21 Marco Aroldi <marco.aroldi@xxxxxxxxx>:
> So, I've restarted the new osds as many as possible and the cluster
> started to move data to the 2 new nodes overnight.
> This morning there was not netowrk traffic and the healt was
>
> HEALTH_ERR 1323 pgs backfill; 150 pgs backfill_toofull; 100 pgs
> backfilling; 114 pgs degraded; 3374 pgs peering; 36 pgs recovering;
> 949 pgs recovery_wait; 3374 pgs stuck inactive; 6289 pgs stuck
> unclean; recovery 2130652/20890113 degraded (10.199%); 58/8914654
> unfound (0.001%); 1 full osd(s); 22 near full osd(s); full,noup,nodown
> flag(s) set
>
> So I have unset the noup and nodown flags and the data started movin again
> I've increased the full ratio to 97% so now there's no "official" full
> osd and the HEALTH_ERR became HEALT_WARN
>
> However, still no access to filesystem
>
> HEALTH_WARN 1906 pgs backfill; 21 pgs backfill_toofull; 52 pgs
> backfilling; 707 pgs degraded; 371 pgs down; 97 pgs incomplete; 3385
> pgs peering; 35 pgs recovering; 1002 pgs recovery_wait; 4 pgs stale;
> 683 pgs stuck inactive; 5898 pgs stuck unclean; recovery
> 3081499/22208859 degraded (13.875%); 487/9433642 unfound (0.005%);
> recovering 11722 o/s, 57040MB/s; 17 near full osd(s)
>
> The osd are flapping in/out again...
>
> I'm disposed to start deleting some portion of data.
> What can I try to do now?
>
> 2013/4/21 Gregory Farnum <greg@xxxxxxxxxxx>:
>> It's not entirely clear from your description and the output you've
>> given us, but it looks like maybe you've managed to bring up all your
>> OSDs correctly at this point? Or are they just not reporting down
>> because you set the "no down" flag...
>>
>> In any case, CephFS isn't going to come up while the underlying RADOS
>> cluster is this unhealthy, so you're going to need to get that going
>> again. Since your OSDs have managed to get themselves so full it's
>> going to be trickier than normal, but if all the rebalancing that's
>> happening is only because you sort-of-didn't-really lose nodes, and
>> you can bring them all back up, you should be able to sort it out by
>> getting all the nodes back up, and then changing your full percentages
>> (by a *very small* amount); since you haven't been doing any writes to
>> the cluster it shouldn't take much data writes to get everything back
>> where it was, although if this has been continuing to backfill in the
>> meanwhile that will need to unwind.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Sat, Apr 20, 2013 at 12:21 PM, John Wilkins <john.wilkins@xxxxxxxxxxx> wrote:
>>> I don't see anything related to lost objects in your output. I just see
>>> waiting on backfill, backfill_toofull, remapped, and so forth. You can read
>>> a bit about what is going on here:
>>> http://ceph.com/docs/next/rados/operations/monitoring-osd-pg/
>>>
>>> Keep us posted as to the recovery, and let me know what I can do to improve
>>> the docs for scenarios like this.
>>>
>>>
>>> On Sat, Apr 20, 2013 at 10:52 AM, Marco Aroldi <marco.aroldi@xxxxxxxxx>
>>> wrote:
>>>>
>>>> John,
>>>> thanks for the quick reply.
>>>> Below you can see my ceph osd tree
>>>> The problem is caused not by the failure itself, but by the "renamed"
>>>> bunch of devices.
>>>> It was like a deadly 15-puzzle
>>>> I think that the solution was to mount the devices in fstab using UUID
>>>> (/dev/disk/by-uuid) instead of /dev/sdX
>>>>
>>>> However, yes I have an entry in my ceph.conf (devs = /dev/sdX1 --
>>>> osd_journal = /dev/sdX2) *and* an entry in my fstab for each OSD
>>>>
>>>> The node with failed disk is s103 (osd.59)
>>>>
>>>> Now i have 5 osd from s203 up and in to try to let ceph rebalance
>>>> data... but is still a bloody mess.
>>>> Look at ceph -w output: is reported a total of 110TB: is wrong... al
>>>> drives are 2TB and i have 49 drives up and in -- total 98Tb
>>>> I think that 110TB (55 osd) was the size before cluster became
>>>> inaccessible
>>>>
>>>> # id    weight    type name    up/down    reweight
>>>> -1    130    root default
>>>> -9    65        room p1
>>>> -3    44            rack r14
>>>> -4    22                host s101
>>>> 11    2                    osd.11    up    1
>>>> 12    2                    osd.12    up    1
>>>> 13    2                    osd.13    up    1
>>>> 14    2                    osd.14    up    1
>>>> 15    2                    osd.15    up    1
>>>> 16    2                    osd.16    up    1
>>>> 17    2                    osd.17    up    1
>>>> 18    2                    osd.18    up    1
>>>> 19    2                    osd.19    up    1
>>>> 20    2                    osd.20    up    1
>>>> 21    2                    osd.21    up    1
>>>> -6    22                host s102
>>>> 33    2                    osd.33    up    1
>>>> 34    2                    osd.34    up    1
>>>> 35    2                    osd.35    up    1
>>>> 36    2                    osd.36    up    1
>>>> 37    2                    osd.37    up    1
>>>> 38    2                    osd.38    up    1
>>>> 39    2                    osd.39    up    1
>>>> 40    2                    osd.40    up    1
>>>> 41    2                    osd.41    up    1
>>>> 42    2                    osd.42    up    1
>>>> 43    2                    osd.43    up    1
>>>> -13    21            rack r10
>>>> -12    21                host s103
>>>> 55    2                    osd.55    up    0
>>>> 56    2                    osd.56    up    0
>>>> 57    2                    osd.57    up    0
>>>> 58    2                    osd.58    up    0
>>>> 59    2                    osd.59    down    0
>>>> 60    2                    osd.60    down    0
>>>> 61    2                    osd.61    down    0
>>>> 62    2                    osd.62    up    0
>>>> 63    2                    osd.63    up    0
>>>> 64    1.5                    osd.64    up    0
>>>> 65    1.5                    osd.65    down    0
>>>> -10    65        room p2
>>>> -7    22            rack r20
>>>> -5    22                host s202
>>>> 22    2                    osd.22    up    1
>>>> 23    2                    osd.23    up    1
>>>> 24    2                    osd.24    up    1
>>>> 25    2                    osd.25    up    1
>>>> 26    2                    osd.26    up    1
>>>> 27    2                    osd.27    up    1
>>>> 28    2                    osd.28    up    1
>>>> 29    2                    osd.29    up    1
>>>> 30    2                    osd.30    up    1
>>>> 31    2                    osd.31    up    1
>>>> 32    2                    osd.32    up    1
>>>> -8    22            rack r22
>>>> -2    22                host s201
>>>> 0    2                    osd.0    up    1
>>>> 1    2                    osd.1    up    1
>>>> 2    2                    osd.2    up    1
>>>> 3    2                    osd.3    up    1
>>>> 4    2                    osd.4    up    1
>>>> 5    2                    osd.5    up    1
>>>> 6    2                    osd.6    up    1
>>>> 7    2                    osd.7    up    1
>>>> 8    2                    osd.8    up    1
>>>> 9    2                    osd.9    up    1
>>>> 10    2                    osd.10    up    1
>>>> -14    21            rack r21
>>>> -11    21                host s203
>>>> 44    2                    osd.44    up    1
>>>> 45    2                    osd.45    up    1
>>>> 46    2                    osd.46    up    1
>>>> 47    2                    osd.47    up    1
>>>> 48    2                    osd.48    up    1
>>>> 49    2                    osd.49    up    0
>>>> 50    2                    osd.50    up    0
>>>> 51    2                    osd.51    up    0
>>>> 52    1.5                    osd.52    up    0
>>>> 53    1.5                    osd.53    up    0
>>>> 54    2                    osd.54    up    0
>>>>
>>>>
>>>> ceph -w
>>>>
>>>> 2013-04-20 19:46:48.608988 mon.0 [INF] pgmap v1352767: 17280 pgs: 58
>>>> active, 12581 active+clean, 1686 active+remapped+wait_backfill, 24
>>>> active+degraded+wait_backfill, 224
>>>> active+remapped+wait_backfill+backfill_toofull, 1061
>>>> active+recovery_wait, 4
>>>> active+degraded+wait_backfill+backfill_toofull, 629 peering, 626
>>>> active+remapped, 72 active+remapped+backfilling, 89 active+degraded,
>>>> 14 active+remapped+backfill_toofull, 1 active+clean+scrubbing, 8
>>>> active+degraded+remapped+wait_backfill, 20
>>>> active+recovery_wait+remapped, 5
>>>> active+degraded+remapped+wait_backfill+backfill_toofull, 162
>>>> remapped+peering, 1 active+degraded+remapped+backfilling, 2
>>>> active+degraded+remapped+backfill_toofull, 13 active+recovering; 49777
>>>> GB data, 72863 GB used, 40568 GB / 110 TB avail; 2965687/21848501
>>>> degraded (13.574%);  recovering 5 o/s, 16363B/s
>>>>
>>>> 2013/4/20 John Wilkins <john.wilkins@xxxxxxxxxxx>:
>>>> > Marco,
>>>> >
>>>> > If you do a "ceph tree" can you see if your OSDs are all up? You seem to
>>>> > have at least one problem related to the backfill OSDs being too full,
>>>> > and
>>>> > some which are near full or full for the purposes of storage. See the
>>>> > following in the documentation to see if this helps:
>>>> >
>>>> >
>>>> > http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity
>>>> >
>>>> > http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling
>>>> >
>>>> > http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#no-free-drive-space
>>>> >
>>>> > Before you start deleting data as a remedy, you'd want to at least try
>>>> > to
>>>> > get the OSDs back up and running first.
>>>> >
>>>> > If rebooting changed the drive names, you might look here:
>>>> >
>>>> > http://ceph.com/docs/master/rados/configuration/osd-config-ref/#general-settings
>>>> >
>>>> > We have default settings for OSD and journal paths, which you could
>>>> > override
>>>> > if you can locate the data and journal sources on the renamed drives. If
>>>> > you
>>>> > mounted them, but didn't add them to the fstab, that might be the source
>>>> > of
>>>> > the problem. I'd rather see you use the default paths, as it would be
>>>> > easier
>>>> > to troubleshoot later. So did you mount the drives, but not add the
>>>> > mount
>>>> > points to fstab?
>>>> >
>>>> > John
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Sat, Apr 20, 2013 at 8:46 AM, Marco Aroldi <marco.aroldi@xxxxxxxxx>
>>>> > wrote:
>>>> >>
>>>> >> Hi,
>>>> >> due a harware failure during expanding ceph, I'm in big trouble
>>>> >> because the cephfs doesn't mount anymore.
>>>> >> I was adding a couple storage nodes, but a disk has failed and after a
>>>> >> reboot the OS (ubuntu 12.04) renamed the remaining devices, so the
>>>> >> entire node has been screwed out.
>>>> >>
>>>> >> Now, from the "sane new node", I'm taking some new osd up and in
>>>> >> because the cluster is near full and I can't revert completely the
>>>> >> situation as before
>>>> >>
>>>> >> *I can* afford data loss, but i need to regain access to the filesystem
>>>> >>
>>>> >> My setup:
>>>> >> 3 mon + 3 mds
>>>> >> 4 storage nodes (i was adding no. 5 and 6)
>>>> >>
>>>> >> Ceph 0.56.4
>>>> >>
>>>> >>
>>>> >> ceph health:
>>>> >> HEALTH_ERR 2008 pgs backfill; 246 pgs backfill_toofull; 74 pgs
>>>> >> backfilling; 134 pgs degraded; 790 pgs peering; 10 pgs recovering;
>>>> >> 1116 pgs recovery_wait; 790 pgs stuck inactive; 4782 pgs stuck
>>>> >> unclean; recovery 3049459/21926624 degraded (13.908%);  recovering 6
>>>> >> o/s, 16316KB/s; 4 full osd(s); 30 near full osd(s); full,noup,nodown
>>>> >> flag(s) set
>>>> >>
>>>> >>
>>>> >>
>>>> >> ceph mds dump:
>>>> >> dumped mdsmap epoch 44
>>>> >> epoch    44
>>>> >> flags    0
>>>> >> created    2013-03-18 14:42:29.330548
>>>> >> modified    2013-04-20 17:14:32.969332
>>>> >> tableserver    0
>>>> >> root    0
>>>> >> session_timeout    60
>>>> >> session_autoclose    300
>>>> >> last_failure    43
>>>> >> last_failure_osd_epoch    18160
>>>> >> compat    compat={},rocompat={},incompat={1=base v0.20,2=client
>>>> >> writeable ranges,3=default file layouts on dirs,4=dir inode in
>>>> >> separate object}
>>>> >> max_mds    1
>>>> >> in    0
>>>> >> up    {0=6376}
>>>> >> failed
>>>> >> stopped
>>>> >> data_pools    [0]
>>>> >> metadata_pool    1
>>>> >> 6376:    192.168.21.11:6800/13457 'm1' mds.0.9 up:replay seq 1
>>>> >> 5945:    192.168.21.13:6800/12999 'm3' mds.-1.0 up:standby seq 1
>>>> >> 5963:    192.168.21.12:6800/22454 'm2' mds.-1.0 up:standby seq 1
>>>> >>
>>>> >>
>>>> >>
>>>> >> ceph mon dump:
>>>> >> epoch 1
>>>> >> fsid d634f7b3-8a8a-4893-bdfb-a95ccca7fddd
>>>> >> last_changed 2013-03-18 14:39:42.253923
>>>> >> created 2013-03-18 14:39:42.253923
>>>> >> 0: 192.168.21.11:6789/0 mon.m1
>>>> >> 1: 192.168.21.12:6789/0 mon.m2
>>>> >> 2: 192.168.21.13:6789/0 mon.m3
>>>> >> _______________________________________________
>>>> >> ceph-users mailing list
>>>> >> ceph-users@xxxxxxxxxxxxxx
>>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > John Wilkins
>>>> > Senior Technical Writer
>>>> > Intank
>>>> > john.wilkins@xxxxxxxxxxx
>>>> > (415) 425-9599
>>>> > http://inktank.com
>>>
>>>
>>>
>>>
>>> --
>>> John Wilkins
>>> Senior Technical Writer
>>> Intank
>>> john.wilkins@xxxxxxxxxxx
>>> (415) 425-9599
>>> http://inktank.com
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com