Re: 4 incomplete PGs causing RGW to go offline?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Rgw.buckets ( which is where the data is being sent ).  I am just surprised that a few incomplete PGs would grind three gateways to a halt.  Granted, the incomplete part of a large hardware failure situation we had and having a min_size setting of 1 didn’t help the situation.  We are not completely innocent, but I would hope that the system as a whole would work together to skip those incomplete PGs.  Fixing them doesn’t appear to be an easy task at this point, hence why we haven’t fixed them yet(I wish that were easier, but I understand the counter argument ).

 

-Brent

 

From: David Turner [mailto:drakonstein@xxxxxxxxx]
Sent: Thursday, January 11, 2018 8:22 PM
To: Brent Kennedy <bkennedy@xxxxxxxxxx>
Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re: 4 incomplete PGs causing RGW to go offline?

 

Which pools are the incomplete PGs a part of? I would say it's very likely that if some of the RGW metadata was incomplete that the daemons wouldn't be happy.

 

On Thu, Jan 11, 2018, 6:17 PM Brent Kennedy <bkennedy@xxxxxxxxxx> wrote:

We have 3 RadosGW servers running behind HAProxy to enable clients to connect to the ceph cluster like an amazon bucket.  After all the failures and upgrade issues were resolved, I cannot get the RadosGW servers to stay online.  They were upgraded to luminous, I even upgraded the OS to Ubuntu 16 on them ( before upgrading to Luminous ).  They used to have apache on them as they ran Hammer and before that firefly.  I removed apache before upgrading to Luminous.  The start up and run for about 4-6 hours before all three start to go offline.  Client traffic is light right now as we are just testing file read/write before we reactivate them ( they switched back to amazon while we fix them ). 

 

Could the 4 incomplete PGs be causing them to go offline?  The last time I saw an issue like this was when recovery wasn’t working 100%, so it seems related since they haven’t been stable since we upgraded( but that was also after the failures we had, which is why I am not trying to specifically blame the upgrade ).

 

When I look at the radosgw log, this is what I see ( the first 2 lines show up plenty before this, they are health checks by the haproxy server, the next two are file requests that 404 fail I am guessing, then the last one is me restarting the service ):

 

2018-01-11 20:14:36.640577 7f5826aa3700  1 ====== req done req=0x7f5826a9d1f0 op status=0 http_status=200 ======

2018-01-11 20:14:36.640602 7f5826aa3700  1 civetweb: 0x56202c567000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +0000] "HEAD / HTTP/1.0" 1 0 - -

2018-01-11 20:14:36.640835 7f5816282700  1 ====== req done req=0x7f581627c1f0 op status=0 http_status=200 ======

2018-01-11 20:14:36.640859 7f5816282700  1 civetweb: 0x56202c610000: 192.168.120.22 - - [11/Jan/2018:20:14:36 +0000] "HEAD / HTTP/1.0" 1 0 - -

2018-01-11 20:14:36.761917 7f5835ac1700  1 ====== starting new request req=0x7f5835abb1f0 =====

2018-01-11 20:14:36.763936 7f5835ac1700  1 ====== req done req=0x7f5835abb1f0 op status=0 http_status=404 ======

2018-01-11 20:14:36.763983 7f5835ac1700  1 civetweb: 0x56202c4ce000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +0000] "HEAD /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:36.772611 7f5808266700  1 ====== starting new request req=0x7f58082601f0 =====

2018-01-11 20:14:36.773733 7f5808266700  1 ====== req done req=0x7f58082601f0 op status=0 http_status=404 ======

2018-01-11 20:14:36.773769 7f5808266700  1 civetweb: 0x56202c6aa000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +0000] "HEAD /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:38.163617 7f5836ac3700  1 ====== starting new request req=0x7f5836abd1f0 =====

2018-01-11 20:14:38.165352 7f5836ac3700  1 ====== req done req=0x7f5836abd1f0 op status=0 http_status=404 ======

2018-01-11 20:14:38.165401 7f5836ac3700  1 civetweb: 0x56202c4e2000: 192.168.120.21 - - [11/Jan/2018:20:14:38 +0000] "HEAD /Jobimages/vendor05/10/3445645/3445645_cover.pdf HTTP/1.1" 1 0 - aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:38.170551 7f5807a65700  1 ====== starting new request req=0x7f5807a5f1f0 =====

2018-01-11 20:14:40.322236 7f58352c0700  1 ====== starting new request req=0x7f58352ba1f0 =====

2018-01-11 20:14:40.323468 7f5834abf700  1 ====== starting new request req=0x7f5834ab91f0 =====

2018-01-11 20:14:41.643365 7f58342be700  1 ====== starting new request req=0x7f58342b81f0 =====

2018-01-11 20:14:41.643358 7f58312b8700  1 ====== starting new request req=0x7f58312b21f0 =====

2018-01-11 20:14:50.324196 7f5829aa9700  1 ====== starting new request req=0x7f5829aa31f0 =====

2018-01-11 20:14:50.325622 7f58332bc700  1 ====== starting new request req=0x7f58332b61f0 =====

2018-01-11 20:14:51.645678 7f58362c2700  1 ====== starting new request req=0x7f58362bc1f0 =====

2018-01-11 20:14:51.645671 7f582e2b2700  1 ====== starting new request req=0x7f582e2ac1f0 =====

2018-01-11 20:15:00.326452 7f5815a81700  1 ====== starting new request req=0x7f5815a7b1f0 =====

2018-01-11 20:15:00.328787 7f5828aa7700  1 ====== starting new request req=0x7f5828aa11f0 =====

2018-01-11 20:15:01.648196 7f580ea73700  1 ====== starting new request req=0x7f580ea6d1f0 =====

2018-01-11 20:15:01.648698 7f5830ab7700  1 ====== starting new request req=0x7f5830ab11f0 =====

2018-01-11 20:15:10.328810 7f5832abb700  1 ====== starting new request req=0x7f5832ab51f0 =====

2018-01-11 20:15:10.329541 7f582f2b4700  1 ====== starting new request req=0x7f582f2ae1f0 =====

2018-01-11 20:15:11.650655 7f582d2b0700  1 ====== starting new request req=0x7f582d2aa1f0 =====

2018-01-11 20:15:11.651401 7f582aaab700  1 ====== starting new request req=0x7f582aaa51f0 =====

2018-01-11 20:15:20.332032 7f582c2ae700  1 ====== starting new request req=0x7f582c2a81f0 =====

2018-01-11 20:15:20.332046 7f582b2ac700  1 ====== starting new request req=0x7f582b2a61f0 =====

2018-01-11 20:15:21.653675 7f582229a700  1 ====== starting new request req=0x7f58222941f0 =====

2018-01-11 20:15:21.655867 7f5821a99700  1 ====== starting new request req=0x7f5821a931f0 =====

2018-01-11 20:15:30.334192 7f580ba6d700  1 ====== starting new request req=0x7f580ba671f0 =====

2018-01-11 20:15:30.334263 7f58252a0700  1 ====== starting new request req=0x7f582529a1f0 =====

2018-01-11 20:15:31.656023 7f582329c700  1 ====== starting new request req=0x7f58232961f0 =====

2018-01-11 20:15:31.658730 7f5825aa1700  1 ====== starting new request req=0x7f5825a9b1f0 =====

2018-01-11 20:15:40.346908 7f5827aa5700  1 ====== starting new request req=0x7f5827a9f1f0 =====

2018-01-11 20:15:40.346968 7f582429e700  1 ====== starting new request req=0x7f58242981f0 =====

2018-01-11 20:15:41.659509 7f5820296700  1 ====== starting new request req=0x7f58202901f0 =====

2018-01-11 20:15:41.661910 7f5806262700  1 ====== starting new request req=0x7f580625c1f0 =====

2018-01-11 20:15:50.339676 7f5820a97700  1 ====== starting new request req=0x7f5820a911f0 =====

2018-01-11 20:15:50.340447 7f5833abd700  1 ====== starting new request req=0x7f5833ab71f0 =====

2018-01-11 20:15:51.661637 7f581b28c700  1 ====== starting new request req=0x7f581b2861f0 =====

2018-01-11 20:15:51.665464 7f5824a9f700  1 ====== starting new request req=0x7f5824a991f0 =====

2018-01-11 20:16:00.342250 7f581fa95700  1 ====== starting new request req=0x7f581fa8f1f0 =====

2018-01-11 20:16:00.342296 7f580aa6b700  1 ====== starting new request req=0x7f580aa651f0 =====

2018-01-11 20:16:01.663620 7f581ea93700  1 ====== starting new request req=0x7f581ea8d1f0 =====

2018-01-11 20:16:01.668467 7f582a2aa700  1 ====== starting new request req=0x7f582a2a41f0 =====

2018-01-11 20:16:10.344220 7f58302b6700  1 ====== starting new request req=0x7f58302b01f0 =====

2018-01-11 20:16:10.345422 7f581ba8d700  1 ====== starting new request req=0x7f581ba871f0 =====

2018-01-11 20:16:11.664968 7f582baad700  1 ====== starting new request req=0x7f582baa71f0 =====

2018-01-11 20:16:11.671974 7f582dab1700  1 ====== starting new request req=0x7f582daab1f0 =====

2018-01-11 20:16:20.345984 7f5810276700  1 ====== starting new request req=0x7f58102701f0 =====

2018-01-11 20:16:20.346372 7f581f294700  1 ====== starting new request req=0x7f581f28e1f0 =====

2018-01-11 20:16:21.667324 7f5819a89700  1 ====== starting new request req=0x7f5819a831f0 =====

2018-01-11 20:16:21.675243 7f5823a9d700  1 ====== starting new request req=0x7f5823a971f0 =====

2018-01-11 20:16:30.347943 7f58292a8700  1 ====== starting new request req=0x7f58292a21f0 =====

2018-01-11 20:16:30.348865 7f581a28a700  1 ====== starting new request req=0x7f581a2841f0 =====

2018-01-11 20:16:31.670269 7f580f274700  1 ====== starting new request req=0x7f580f26e1f0 =====

2018-01-11 20:16:31.678598 7f5818286700  1 ====== starting new request req=0x7f58182801f0 =====

2018-01-11 20:16:40.350418 7f58272a4700  1 ====== starting new request req=0x7f582729e1f0 =====

2018-01-11 20:16:40.351565 7f582eab3700  1 ====== starting new request req=0x7f582eaad1f0 =====

2018-01-11 20:16:41.671624 7f581e292700  1 ====== starting new request req=0x7f581e28c1f0 =====

2018-01-11 20:16:41.682522 7f5819288700  1 ====== starting new request req=0x7f58192821f0 =====

2018-01-11 20:16:50.352821 7f5817a85700  1 ====== starting new request req=0x7f5817a7f1f0 =====

2018-01-11 20:16:50.357997 7f5806a63700  1 ====== starting new request req=0x7f5806a5d1f0 =====

2018-01-11 20:16:51.674867 7f581227a700  1 ====== starting new request req=0x7f58122741f0 =====

2018-01-11 20:16:51.685882 7f5811a79700  1 ====== starting new request req=0x7f5811a731f0 =====

2018-01-11 20:17:00.356027 7f5812a7b700  1 ====== starting new request req=0x7f5812a751f0 =====

2018-01-11 20:17:00.360732 7f581c28e700  1 ====== starting new request req=0x7f581c2881f0 =====

2018-01-11 20:17:01.678524 7f5815280700  1 ====== starting new request req=0x7f581527a1f0 =====

2018-01-11 20:17:01.689199 7f5816a83700  1 ====== starting new request req=0x7f5816a7d1f0 =====

2018-01-11 20:17:10.358813 7f580fa75700  1 ====== starting new request req=0x7f580fa6f1f0 =====

2018-01-11 20:17:10.363121 7f581da91700  1 ====== starting new request req=0x7f581da8b1f0 =====

2018-01-11 20:17:11.682017 7f581427e700  1 ====== starting new request req=0x7f58142781f0 =====

2018-01-11 20:17:11.693168 7f5811278700  1 ====== starting new request req=0x7f58112721f0 =====

2018-01-11 20:17:20.366413 7f5809a69700  1 ====== starting new request req=0x7f5809a631f0 =====

2018-01-11 20:17:20.366555 7f5821298700  1 ====== starting new request req=0x7f58212921f0 =====

2018-01-11 20:17:21.684856 7f580ca6f700  1 ====== starting new request req=0x7f580ca691f0 =====

2018-01-11 20:17:21.696645 7f5813a7d700  1 ====== starting new request req=0x7f5813a771f0 =====

2018-01-11 20:17:30.366328 7f580a26a700  1 ====== starting new request req=0x7f580a2641f0 =====

2018-01-11 20:17:30.366715 7f5826aa3700  1 ====== starting new request req=0x7f5826a9d1f0 =====

2018-01-11 20:17:31.687722 7f5816282700  1 ====== starting new request req=0x7f581627c1f0 =====

2018-01-11 20:17:31.700560 7f5809268700  1 ====== starting new request req=0x7f58092621f0 =====

2018-01-11 20:17:40.369569 7f5835ac1700  1 ====== starting new request req=0x7f5835abb1f0 =====

2018-01-11 20:17:40.369956 7f5808266700  1 ====== starting new request req=0x7f58082601f0 =====

2018-01-11 20:17:41.689913 7f5836ac3700  1 ====== starting new request req=0x7f5836abd1f0 =====

2018-01-11 22:17:14.888135 7f5838ac7700 -1 received  signal: Terminated from  PID: 1 task name: /sbin/init  UID: 0

2018-01-11 22:17:14.888161 7f5838ac7700  1 handle_sigterm

2018-01-11 22:17:14.888198 7f5838ac7700  1 handle_sigterm set alarm for 120

2018-01-11 22:17:14.888209 7f58698ebe80 -1 shutting down

2018-01-11 22:18:45.116476 7f5987be2e80  0 deferred set uid:gid to 64045:64045 (ceph:ceph)

2018-01-11 22:18:45.116716 7f5987be2e80  0 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process (unknown), pid 38132

2018-01-11 22:18:45.258934 7f5987be2e80  0 starting handler: civetweb

2018-01-11 22:18:45.266871 7f5987be2e80  1 mgrc service_daemon_register rgw.radosgw1 metadata {arch=x86_64,ceph_version=ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),cpu=

Intel(R) Xeon(R) CPU           L5520  @ 2.27GHz,distro=ubuntu,distro_description=Ubuntu 16.04.3 LTS,distro_version=16.04,frontend_config#0=civetweb port=80,frontend_type#0=civetweb,hostname=ukradosgw1,kernel

_description=#127-Ubuntu SMP Mon Dec 11 12:16:42 UTC 2017,kernel_version=4.4.0-104-generic,mem_swap_kb=12580860,mem_total_kb=12286220,num_handles=1,os=Linux,pid=38132,zone_id=default,zone_name=default,zonegr

oup_id=default,zonegroup_name=default}

 

Its like the service stops responding…

 

-Brent

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux