Re: Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

flisky <yinjifeng@xxxxxxxxxxx> · Sun, 17 May 2015 02:13:55 +0800

On 2015年01月10日 03:21, Gregory Farnum wrote:
On Fri, Jan 9, 2015 at 2:00 AM, Nico Schottelius
<nico-ceph-users@xxxxxxxxxxxxxxx> wrote:
Lionel, Christian,

we do have the exactly same trouble as Christian,
namely

Christian Eichelmann [Fri, Jan 09, 2015 at 10:43:20AM +0100]:
We still don't know what caused this specific error...

and

...there is currently no way to make ceph forget about the data of this pg and create it as an empty one. So the only way
to make this pool usable again is to loose all your data in there.

I wonder what is the position of ceph developers regarding
dropping (emptying) specific pgs?
Is that a use case that was never thought of or tested?

I've never worked directly on any of the cluster this has happened to,
but I believe every time we've seen issues like this with somebody we
have a relationship with it's either:
1) been resolved by using the existing tools to stuff lost, or
2) been the result of local filesystems/disks silently losing data due
to some fault or other.

The second case means the OSDs have corrupted state and trusting them
is tricky. Also, most people we've had relationships with that this
has happened to really want to not lose all the data in the PG, which
necessitates manually mucking around anyway. ;)

Mailing list issues are obviously a lot harder to categorize, but the
ones we've taken time on where people say the commands don't work have
generally fallen into the second bucket.

Hi Gregory,

If you want to experiment, I think all the manual mucking around has
been done with the objectstore tool and removing bad PGs, moving them
around, or faking journal entries, but I've not done it myself so I
could be mistaken.
-Greg

I face the some problem(incomplete pg, and force_create_pg doesn't 
help), and search the whole internet to get this.

I'm trying mocking ...

If the pgid is 12.bb1,

* service ceph stop osd.xx

* ceph-objectstore-tool --op export --pgid 12.bb1 --data-path 
/var/lib/ceph/osd/ceph-xx/ --journal-path 
/var/lib/ceph/osd/ceph-xx/journal --file 12.bb1.export

the structure 12.bb1_head/__head_00000BB1__c is zero size, which is made 
by hand, but the exported file '12.bb1.export' contains some data, maybe 
recovering from OSD journal.

* ceph-objectstore-tool --op import --data-path 
/var/lib/ceph/osd/ceph-xx/ --journal-path 
/var/lib/ceph/osd/ceph-xx/journal --file 12.bb1.export

succeed

* service ceph start osd.xx  -- failed

traceback:

    -8> 2015-05-17 01:49:16.658789 7f528608a880 10 osd.6 17288 
build_past_intervals_parallel epoch 11356
    -7> 2015-05-17 01:49:16.658902 7f528608a880 10 osd.6 0 add_map_bl 
11356 65577 bytes
    -6> 2015-05-17 01:49:16.659697 7f528608a880 10 osd.6 17288 
build_past_intervals_parallel epoch 11356 pg 12.bb1 
generate_past_intervals interval(11355-11355 up [11,13,5](6) acting 
[5](6)): not rw, up_thru 11331 up_from 3916 last_epoch_clean 11276
generate_past_intervals interval(11355-11355 up [11,13,5](6) acting 
[5](6)) : primary up 3916-11331 does not include interval

    -5> 2015-05-17 01:49:16.659705 7f528608a880 10 osd.6 17288 
build_past_intervals_parallel epoch 11357
    -4> 2015-05-17 01:49:16.659852 7f528608a880 10 osd.6 0 add_map_bl 
11357 70389 bytes
    -3> 2015-05-17 01:49:16.660622 7f528608a880 10 osd.6 17288 
build_past_intervals_parallel epoch 11357 pg 12.bb1 
generate_past_intervals interval(11356-11356 up [11,13,5](6) acting 
[5](6)): not rw, up_thru 11331 up_from 3916 last_epoch_clean 11276
generate_past_intervals interval(11356-11356 up [11,13,5](6) acting 
[5](6)) : primary up 3916-11331 does not include interval

    -2> 2015-05-17 01:49:16.660630 7f528608a880 10 osd.6 17288 
build_past_intervals_parallel epoch 11358
    -1> 2015-05-17 01:49:16.660751 7f528608a880 10 osd.6 0 add_map_bl 
11358 70389 bytes
     0> 2015-05-17 01:49:16.663571 7f528608a880 -1 osd/OSDMap.h: In 
function 'const epoch_t& OSDMap::get_up_from(int) const' thread 
7f528608a880 time 2015-05-17 01:49:16.661507
osd/OSDMap.h: 502: FAILED assert(exists(osd))

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x85) [0xbc51f5]
 2: /usr/bin/ceph-osd() [0x63d66c]
 3: (pg_interval_t::check_new_interval(int, int, std::vector<int, 
std::allocator<int> > const&, std::vector<int, std::allocator<int> > 
const&, int, int, std::vector<int, std::allocator<int> > const&, 
std::vector<int, std::allocator<int> > const&, unsigned int, unsigned 
int, std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap 
const>, pg_t, std::map<unsigned int, pg_interval_t, std::less<unsigned 
int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > >*, 
std::ostream*)+0x605) [0x797745]
 4: (OSD::build_past_intervals_parallel()+0x987) [0x69fb37]
 5: (OSD::load_pgs()+0x19cf) [0x6b767f]
 6: (OSD::init()+0x729) [0x6b8b99]
 7: (main()+0x27f3) [0x643b63]
 8: (__libc_start_main()+0xf5) [0x7f5283433af5]
 9: /usr/bin/ceph-osd() [0x65cdc9]

Could you please tell me how to bypass the check_new_interval function?

Thanks!

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com