pg becomes creating+peering state after reweight with master branch code

Wangpan <aspirer2004@xxxxxxxxx> · Tue, 3 Dec 2019 12:36:56 +0800

Hi all,

I am backporting the async recovery feature to Hammer :(   --- this is
a sad story, but we can't upgrade to Luminous or other new version in
our product envs.
the teuthology tests catch an osd crash error after backport
https://github.com/ceph/ceph/pull/19811
But I have reproduced it with master branch code, the detail are
described as below:

code version: master branch with latest commit
656c8e8049c2c1acd363143c842c2edf1fe09b64

config of vstart:
[client.vstart.sh]
num mon = 1
num osd = 5
num mds = 0
num mgr = 1
num rgw = 0

config of osd/mon/mgr:
[osd]
osd_async_recovery_min_cost = 0  # only add this one, others are
default with vstart

vstart command:
MON=1 OSD=5 MDS=0 MGR=1 RGW=0 ../src/vstart.sh --debug --new -X
--localhost --bluestore --without-dashboard

pool info:
pool 1 'rbd' replicated size 3 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 31
flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

osd log:
2019-12-02T09:32:28.779+0000 7f8ff8c82700  5 osd.2 pg_epoch: 783
pg[1.c2( v 657'1488 (0'0,657'1488] local-lis/les=0/0 n=13 ec=25/25
lis/c=0/449 les/c/f=0/450/0 sis=783) [2,1,0] r=0 lpr=783
pi=[449,783)/1 crt=657'1488 mlcod 0'0 unknown mbc={}] enter
Started/Primary
2019-12-02T09:32:28.779+0000 7f8ff8c82700  5 osd.2 pg_epoch: 783
pg[1.c2( v 657'1488 (0'0,657'1488] local-lis/les=0/0 n=13 ec=25/25
lis/c=0/449 les/c/f=0/450/0 sis=783) [2,1,0] r=0 lpr=783
pi=[449,783)/1 crt=657'1488 mlcod 0'0 creating mbc={}] enter
Started/Primary/Peering         ### pg state is creating
2019-12-02T09:32:28.779+0000 7f8ff8c82700  5 osd.2 pg_epoch: 783
pg[1.c2( v 657'1488 (0'0,657'1488] local-lis/les=0/0 n=13 ec=25/25
lis/c=0/449 les/c/f=0/450/0 sis=783) [2,1,0] r=0 lpr=783
pi=[449,783)/1 crt=657'1488 mlcod 0'0 creating+peering mbc={}] enter
Started/Primary/Peering/GetInfo             ### pg state is
creating+peering

Reproduce steps:
1. create a pool, wait for active+clean
2. writes to rbd image with fio or other tool during steps below
3. ceph osd reweight 2 0.1
4. wait several minutes, make sure pg are moved to other osd(grep
"on_removal" osd.2.log)
5. ceph osd reweight 2 1
6. wait several minutes, make sure pg are moved back to original
osd.2(grep "_make_pg" osd.2.log)
7. find a up_primary pg(such as 1.c2 in my log) on osd.2 which was
moved out/back during steps 3~6, and it should enter in async
recovering after step 6
8. wait for pg becomes active+clean, then you can find it had become
to creating+peering state.

the root reason may be:
1. after step 6, osd.0 will be async recovery target in pg 1.c2, and
pg will be created after reweight to 1
2. after pg->init the local_les=0/history.les=450
2019-12-02T09:21:47.991+0000 7f8ff8c82700 10 osd.2 543 _make_pg 1.c2
2019-12-02T09:21:47.991+0000 7f8ff8c82700 10 osd.2 pg_epoch: 543
pg[1.c2( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0
les/c/f=0/0/0 sis=0) [] r=-1 lpr=0 crt=0'0 unknown mbc={}] init role 0
up [2,1,0] acting [2,1,0] history ec=25/25 lis/c=449/449
les/c/f=450/450/0 sis=543 pruub=14.367934206s past_intervals
([449,542] all_participants=0,1,3 intervals=([449,542] acting 0,1,3))
2019-12-02T09:21:47.991+0000 7f8ff8c82700 20 osd.2 pg_epoch: 543
pg[1.c2( empty local-lis/les=0/0 n=0 ec=25/25 lis/c=449/449
les/c/f=450/450/0 sis=543 pruub=14.367934206s) [2,1,0] r=0 lpr=0
pi=[449,543)/1 crt=0'0 mlcod 0'0 unknown mbc={}] on_new_interval
2019-12-02T09:21:48.063+0000 7f8ff8c82700 20 osd.2 pg_epoch: 543
pg[1.c2( empty local-lis/les=0/0 n=0 ec=25/25 lis/c=449/449
les/c/f=450/450/0 sis=543) [2,1,0] r=0 lpr=543 pi=[449,543)/1 crt=0'0
mlcod 0'0 peering mbc={}] choose_async_recovery_replicated result
want=[0,1] async_recovery=2
2019-12-02T09:21:48.983+0000 7f8ff8c82700 20 osd.2 pg_epoch: 544
pg[1.c2( empty local-lis/les=0/0 n=0 ec=25/25 lis/c=449/449
les/c/f=450/450/0 sis=543) [2,1,0] r=0 lpr=544 pi=[449,543)/1 crt=0'0
mlcod 0'0 unknown mbc={}] new interval newup [2,1,0] newacting [0,1]
 ## osd.2 is async recovery target in pg 1.c2
acting_primary osd.0 log:
2019-12-02T09:21:49.195+0000 7f2f01060700 20 osd.0 pg_epoch: 544
pg[1.c2( v 542'940 (0'0,542'940] local-lis/les=449/450 n=13 ec=25/25
lis/c=449/449 les/c/f=450/450/0 sis=544) [2,1,0]/[0,1] r=0 lpr=544
pi=[449,544)/1 crt=542'940 lcod 542'939 mlcod 0'0 remapped+peering
mbc={}] choose_async_recovery_replicated result want=[0,1]
async_recovery=2
3. when repop is coming to osd.2 of pg 1.c2, the append_log func will
find local_les(=0) != history.les(=450), and it will use local_les as
new history.les, then history.les become 0

void PeeringState::append_log(
  const vector<pg_log_entry_t>& logv,
  eversion_t trim_to,
  eversion_t roll_forward_to,
  ObjectStore::Transaction &t,
  bool transaction_applied,
  bool async)
{
  /* The primary has sent an info updating the history, but it may not
   * have arrived yet.  We want to make sure that we cannot remember this
   * write without remembering that it happened in an interval which went
   * active in epoch history.last_epoch_started.
   */
  if (info.last_epoch_started != info.history.last_epoch_started) {
    info.history.last_epoch_started = info.last_epoch_started;
  }
  ...
}

4. when the async recovery of osd.2 in pg 1.c2 is over, it will change
to acting_primary, pg state will be set to PG_STATE_CREATING

PeeringState::Primary::Primary(my_context ctx)
  : my_base(ctx),
    NamedState(context< PeeringMachine >().state_history, "Started/Primary")
{
  context< PeeringMachine >().log_enter(state_name);
  DECLARE_LOCALS;
  ceph_assert(ps->want_acting.empty());

  // set CREATING bit until we have peered for the first time.
  if (ps->info.history.last_epoch_started == 0) {
    ps->state_set(PG_STATE_CREATING);
  ...
}

So my question is that, this PG_STATE_CREATING state after async
recovery is expected or not?
if it is, I guess this creating state may result in osd crash, if
acting_primary is changed during creating+peering state, the process
may be:
1. osd.2 report pg stats to mon
2. mon will add this pg to creating_pgs/creating_pgs_by_osd_epoch
void PGMap::stat_pg_add(const pg_t &pgid, const pg_stat_t &s,
                        bool sameosds)
{
  auto pool = pgid.pool();
  pg_sum.add(s);

  num_pg++;
  num_pg_by_state[s.state]++;
  num_pg_by_pool_state[pgid.pool()][s.state]++;
  num_pg_by_pool[pool]++;

  if ((s.state & PG_STATE_CREATING) &&
      s.parent_split_bits == 0) {
    creating_pgs.insert(pgid);
    if (s.acting_primary >= 0) {
      creating_pgs_by_osd_epoch[s.acting_primary][s.mapping_epoch].insert(pgid);
    }
  }
  ...
}
3. when the acting_primary change to a new one, the new acting_primary
will receive a MOSDPGCreate/MOSDPGCreate2 msg with a very old pg
created epoch(the real created epoch is 5)
4. the new acting_primary will get_osdmap by pg created epoch, this
map is trimmed long time ago, then osd will crash at:
void OSD::build_initial_pg_history(
  spg_t pgid,
  epoch_t created,
  utime_t created_stamp,
  pg_history_t *h,
  PastIntervals *pi)
{
  dout(10) << __func__ << " " << pgid << " created " << created << dendl;
  *h = pg_history_t(created, created_stamp);

  OSDMapRef lastmap = service.get_map(created);
  ...
}
  OSDMapRef get_map(epoch_t e) {
    OSDMapRef ret(try_get_map(e));
    ceph_assert(ret);                 // crash here because map is trimmed
    return ret;
  }

I got a related issue here: https://tracker.ceph.com/issues/14592 osd
crashes when handling a stale pg-create message (hammer) ,
but I'm not sure they are same reason or not.

Thanks for your attention, welcome to ask me for more detail.
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx