radosgw hung when OS disks went readonly, different node radosgw restart fixed it

Sean Purdy <s.purdy@xxxxxxxxxxxxxxxx> · Mon, 31 Jul 2017 15:45:30 +0100

Hi,

Just had an incident in a 3-node test cluster running 12.1.1 on debian stretch

Each cluster had its own mon, mgr, radosgw, and osds.  Just object store.

I had s3cmd looping and uploading files via S3.

On one of the machines, the RAID controller barfed and dropped the OS disks.  Or the disks failed.  TBC.  Anyway, / and /var went readonly.

The monitor on that machine found it couldn't write its logs and died.  But the OSDs stayed up - those disks didn't go readonly.

health: HEALTH_WARN
        1/3 mons down, quorum store01,store03
osd: 18 osds: 18 up, 18 in
rgw: 3daemons active

The S3 process started timing out on connections to radosgw.  Even when talking to one of the other two radosgw instances.  (I'm RRing the DNS records at the moment).

I stopped the OSDs on that box.  No change.  I stopped radosgw on that box.  Still no change.  The S3 upload process was still hanging/timing out.  A manual telnet to port 80 on the good nodes still hung.

"radosgw-admin bucket list" showed buckets &c

Then I restarted radosgw on one of the other two nodes.  After about a minute, the looping S3 upload process started working again.

So my questions:  Why did I have to manually restart radosgw on one of the other nodes?  Why didn't it either keep working, or e.g. start working when radosgw was stopped on the bad node?

Also where are the radosgw server/access logs?

I know it's probably an unusual edge case or something, but we're aiming for HA and redundancy.

Thanks!

Sean Purdy
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com