I didn't know there was a replacement for the radosgw role! I saw in the ceph-ansible project mention of a radosgw load balancer but since I use haproxy, I didn't dig into that. Is that what you are referring to? Otherwise, I cant seem to find any mention of cive being replaced. For the issue below, I guess the dev was using a single threaded process that was out of control. They have done it a few times now and kills all four gateways. I asked them to stop and so far no repeats. For deletes, they should be using the bucket item aging anyways. -Brent -----Original Message----- From: Eugen Block <eblock@xxxxxx> Sent: Friday, October 23, 2020 7:00 AM To: ceph-users@xxxxxxx Subject: Re: Rados Crashing Hi, I read that civetweb and radosgw have a locking issue in combination with ssl [1], just a thought based on > failed to acquire lock on obj_delete_at_hint.0000000079 Since Nautilus the default rgw frontend is beast, have you thought about switching? Regards, Eugen [1] https://tracker.ceph.com/issues/22951 Zitat von Brent Kennedy <bkennedy@xxxxxxxxxx>: > We are performing file maintenance( deletes essentially ) and when the > process gets to a certain point, all four rados gateways crash with > the > following: > > > > > > Log output: > > -5> 2020-10-20 06:09:53.996 7f15f1543700 2 req 7 0.000s s3:delete_obj > verifying op params > > -4> 2020-10-20 06:09:53.996 7f15f1543700 2 req 7 0.000s > s3:delete_obj pre-executing > > -3> 2020-10-20 06:09:53.996 7f15f1543700 2 req 7 0.000s > s3:delete_obj executing > > -2> 2020-10-20 06:09:53.997 7f161758f700 10 monclient: > get_auth_request con 0x55d2c02ff800 auth_method 0 > > -1> 2020-10-20 06:09:54.009 7f1609d74700 5 process_single_shard(): > failed to acquire lock on obj_delete_at_hint.0000000079 > > 0> 2020-10-20 06:09:54.035 7f15f1543700 -1 *** Caught signal > (Segmentation fault) ** > > in thread 7f15f1543700 thread_name:civetweb-worker > > > > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) > nautilus > (stable) > > 1: (()+0xf5d0) [0x7f161d3405d0] > > 2: (()+0x2bec80) [0x55d2bcd1fc80] > > 3: (std::string::assign(std::string const&)+0x2e) [0x55d2bcd2870e] > > 4: (rgw_bucket::operator=(rgw_bucket const&)+0x11) [0x55d2bce3e551] > > 5: (RGWObjManifest::obj_iterator::update_location()+0x184) > [0x55d2bced7114] > > 6: (RGWObjManifest::obj_iterator::operator++()+0x263) [0x55d2bd092793] > > 7: (RGWRados::update_gc_chain(rgw_obj&, RGWObjManifest&, > cls_rgw_obj_chain*)+0x51a) [0x55d2bd0939ea] > > 8: (RGWRados::Object::complete_atomic_modification()+0x83) > [0x55d2bd093c63] > > 9: (RGWRados::Object::Delete::delete_obj()+0x74d) [0x55d2bd0a87ad] > > 10: (RGWDeleteObj::execute()+0x915) [0x55d2bd04b6d5] > > 11: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, > req_state*, bool)+0x915) [0x55d2bcdfbb35] > > 12: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::string > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, > OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, int*)+0x1cd8) > [0x55d2bcdfdea8] > > 13: (RGWCivetWebFrontend::process(mg_connection*)+0x38e) > [0x55d2bcd41a1e] > > 14: (()+0x36bace) [0x55d2bcdccace] > > 15: (()+0x36d76f) [0x55d2bcdce76f] > > 16: (()+0x36dc18) [0x55d2bcdcec18] > > 17: (()+0x7dd5) [0x7f161d338dd5] > > 18: (clone()+0x6d) [0x7f161c84302d] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > > > My guess is that we need to add more resources to the gateways? They > have 2 CPUs and 12GB of memory running as virtual machines on centOS > 7.6 . Any thoughts? > > > > -Brent > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx