RGWObjectExpirer crashing after upgrade from 14.2.0 to 14.2.3

Liam Monahan <liam@xxxxxxxxxxxxxx> · Thu, 19 Sep 2019 14:34:41 -0400

Hi,

I recently took our test cluster up to a new version and am no longer able to start radosgw.  The cluster itself (mon, osd, mgr) appears fine.

Without being much of an expert trying to read this, from the errors that were being thrown it seems like the object expirer is choking ok handling resharded buckets.  There have been no recent reshard operations on this cluster, and dynamic resharding is disabled.  I though this could’ve been related to https://github.com/ceph/ceph/pull/27817 but that landed by v14.2.3...

Logs from starting up radosgw:

 -26> 2019-09-17 16:18:45.719 7f2d93da2780 0 starting handler: civetweb 
   -25> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: allow_unicode_in_urls: yes 
   -24> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: canonicalize_url_path: no 
   -23> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: decode_url: no 
   -22> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: enable_auth_domain_check: no 
   -21> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: enable_keep_alive: yes 
   -20> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: listening_ports: 7480,7481s 
   -19> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: num_threads: 512 
   -18> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: run_as_user: ceph 
   -17> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: ssl_certificate: '/etc/ceph/r 
gw.pem' 
   -16> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: validate_http_method: no 
   -15> 2019-09-17 16:18:45.720 7f2d93da2780 0 civetweb: 0x55d622628600: ssl_use_pem_file: ca 
nnot open certificate file '/etc/ceph/rgw.pem': error:02001002:system library:fopen:No such fi 
le or directory 
   -14> 2019-09-17 16:18:45.720 7f2d93da2780 -1 ERROR: failed run 
   -13> 2019-09-17 16:18:45.721 7f2d5c97b700 5 lifecycle: schedule life cycle next start time 
: Wed Sep 18 04:00:00 2019 
   -12> 2019-09-17 16:18:45.721 7f2d5f180700 20 reqs_thread_entry: start 
   -11> 2019-09-17 16:18:45.721 7f2d5e97f700 20 cr:s=0x55d625c94360:op=0x55d625bcd800:20MetaMa 
sterTrimPollCR: operate() 
   -10> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c94360 is io blocked 
    -9> 2019-09-17 16:18:45.721 7f2d5e97f700 20 cr:s=0x55d625c94480:op=0x55d625a68c00:17DataLo 
gTrimPollCR: operate() 
    -8> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c94480 is io blocked 
    -7> 2019-09-17 16:18:45.721 7f2d5e97f700 20 cr:s=0x55d625c945a0:op=0x55d625a69200:16Bucket 
TrimPollCR: operate() 
    -6> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c945a0 is io blocked 
    -5> 2019-09-17 16:18:45.721 7f2d5c17a700 20 BucketsSyncThread: start 
    -4> 2019-09-17 16:18:45.721 7f2d5b979700 20 UserSyncThread: start 
    -3> 2019-09-17 16:18:45.721 7f2d5b178700 20 process_all_logshards Resharding is disabled 
    -2> 2019-09-17 16:18:45.721 7f2d5d97d700 20 reqs_thread_entry: start 
    -1> 2019-09-17 16:18:45.724 7f2d731a8700 20 processing shard = obj_delete_at_hint.00000000 
01 
     0> 2019-09-17 16:18:45.726 7f2d731a8700 -1 *** Caught signal (Aborted) ** 
 in thread 7f2d731a8700 thread_name:rgw_obj_expirer 

 ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) nautilus (stable) 
 1: (()+0xf630) [0x7f2d86ff7630] 
 2: (gsignal()+0x37) [0x7f2d86431377] 
 3: (abort()+0x148) [0x7f2d86432a68] 
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f2d86d417d5] 
 5: (()+0x5e746) [0x7f2d86d3f746] 
 6: (()+0x5e773) [0x7f2d86d3f773] 
 7: (()+0x5e993) [0x7f2d86d3f993] 
 8: (()+0x1772b) [0x7f2d92efb72b] 
 9: (tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0xf3) [0x7f2d92f19a03] 
 10: (()+0x70a8a2) [0x55d6222508a2] 
 11: (()+0x70a8e8) [0x55d6222508e8] 
 12: (RGWObjectExpirer::process_single_shard(std::string const&, utime_t const&, utime_t const 
&)+0x115) [0x55d622253155] 
 13: (RGWObjectExpirer::inspect_all_shards(utime_t const&, utime_t const&)+0xab) [0x55d6222538 
2b] 
 14: (RGWObjectExpirer::OEWorker::entry()+0x273) [0x55d622253c43] 
 15: (()+0x7ea5) [0x7f2d86fefea5] 
 16: (clone()+0x6d) [0x7f2d864f98cd] 
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

As another datapoint getting bucket stats fails now for two buckets in the cluster:

[root@cephproxy01 ~]# radosgw-admin bucket stats --bucket=wHjAk0t
failure: (2) No such file or directory:
2019-09-19 14:21:59.483 7f54ed3fd6c0 -1 ERROR: get_bucket_instance_from_oid failed: -2

[root@cephproxy01 ~]# radosgw-admin bucket stats --bucket=bzUi3MT
failure: (2) No such file or directory:
2019-09-19 14:22:16.324 7fbd172666c0 -1 ERROR: get_bucket_instance_from_oid failed: -2

Has anyone seen this before?  Didn’t see a lot on this from googling.  Let me know if I can provide any more useful debugging information.

Thanks,
Liam
---
University of Maryland
Institute for Advanced Computer Studies
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com