Re: RBD-mirror instabilities

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Jason,

Am 13.02.20 um 01:28 schrieb Jason Dillaman:
> On Wed, Feb 12, 2020 at 2:53 PM Oliver Freyermuth
> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> Dear Jason,
>>
>> Am 12.02.20 um 19:29 schrieb Jason Dillaman:
>>> On Wed, Feb 12, 2020 at 11:55 AM Oliver Freyermuth
>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> Dear Cephalopodians,
>>>>
>>>> for those on the list also fighting rbd mirror process instabilities: With 14.2.7 (but maybe it was also present before, it does not happen often),
>>>> I very rarely encounter a case where none of the two described hacks I use is working anymore, since "ceph daemon /var/run/ceph/ceph-client.rbd_mirror...." just hangs forever.
>>>>
>>>> So I am now using the following three cronjobs:
>>>>
>>>> 1) Restart RBD mirror if any images enter unknown or stopped state. This is probably not needed since (2) should catch everything:
>>>> ----------------------
>>>> rbd --id=${rbd_id} mirror pool status | egrep -q 'unknown|stopped' && systemctl -q is-active ceph-rbd-mirror@${rbd_id}.service && systemctl restart ceph-rbd-mirror@${rbd_id}.service
>>>> ----------------------
>>>>
>>>> 2) Restart RBD mirror if the RBD client is not replaying anything anymore (happens randomly for us due to client blacklisting when many OSDs restart, and it never recovers from that situation):
>>>> ----------------------
>>>> ceph daemon /var/run/ceph/ceph-client.${rbd_id}.\$(systemctl show --property MainPID ceph-rbd-mirror@${rbd_id}.service | sed 's/MainPID=//').*.asok rbd mirror status | grep -q Replaying || (systemctl -q is-active ceph-rbd-mirror@${rbd_id}.service && systemctl restart ceph-rbd-mirror@${rbd_id}.service)
>>>> ----------------------
>>>
>>> If you can repeat this situation, can you provide any supporting
>>> rbd-mirror daemon logs? The rbd-mirror daemon periodically scans (~30
>>> seconds) to see if it has been blacklisted and then starts a recovery
>>> process. If something is broken with this, we would like to fix it.
>>
>> of course — I did not even need to repeat it manually, but one of our RBD Mirrors has heard you ;-).
>> Interestingly, I did not see any OSDs going down this time, only restarted some VMs, but this should hopefully not affect anything here...
>>
>> Here are the logs from this occasion:
>>
>> 2020-02-12 19:57:03.178 7f39519c2040  0 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process rbd-mirror, pid 4003991
>> 2020-02-12 19:57:03.234 7f39519c2040  1 mgrc service_daemon_register rbd-mirror.3286479 metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable),ceph_version_short=14.2.7,cpu=Intel(R) Xeon(R) CPU           E5410  @ 2.33GHz,distro=centos,distro_description=CentOS Linux 7 (Core),distro_version=7,hostname=rbd-mirror002.example.com,id=rbd_mirror_backup,instance_id=3286479,kernel_description=#1 SMP Fri Dec 6 15:49:49 UTC 2019,kernel_version=3.10.0-1062.9.1.el7.x86_64,mem_swap_kb=5119996,mem_total_kb=16264948,os=Linux}
>> 2020-02-12 19:57:06.533 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x5614036bdf80 [2/73b65031-0770-4c72-bf74-896ad960dc8f] handle_shut_down: mirror image no longer exists
>> 2020-02-12 19:57:30.568 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x561403c89200 [2/61b047c8-7069-4f6d-abbd-4515c21a1c04] handle_shut_down: mirror image no longer exists
>> 2020-02-12 19:57:30.628 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x561403c8a400 [2/87ba1bbb-7afc-4604-830c-5b0abdf0c0e5] handle_shut_down: mirror image no longer exists
>> 2020-02-12 19:57:30.658 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x561404c0a900 [2/f9c1e3ad-3552-4b41-b3d3-c1b0daf58122] handle_shut_down: mirror image no longer exists
>> 2020-02-12 19:57:33.584 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x561403c88480 [2/61b047c8-7069-4f6d-abbd-4515c21a1c04] handle_shut_down: mirror image no longer exists
>> 2020-02-12 20:10:04.855 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x56140564c880 [2/61b047c8-7069-4f6d-abbd-4515c21a1c04] handle_shut_down: mirror image no longer exists
>> 2020-02-12 20:10:04.879 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x561403cd1f80 [2/73b65031-0770-4c72-bf74-896ad960dc8f] handle_shut_down: mirror image no longer exists
>> 2020-02-12 20:10:04.890 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x561403cd1680 [2/87ba1bbb-7afc-4604-830c-5b0abdf0c0e5] handle_shut_down: mirror image no longer exists
>> 2020-02-12 20:10:04.971 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x561403cd0480 [2/f9c1e3ad-3552-4b41-b3d3-c1b0daf58122] handle_shut_down: mirror image no longer exists
>> 2020-02-12 20:11:30.786 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x561404d04900 [2/81b08854-60d9-4ac7-9169-c27db7966a06] handle_shut_down: mirror image no longer exists
>> 2020-02-12 20:11:30.793 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x561404d05f80 [2/ac260aaf-b6e4-465e-b804-af612e5ad531] handle_shut_down: mirror image no longer exists
>> 2020-02-12 20:11:30.796 7f39339b4700  0 rbd::mirror::ImageReplayer: 0x561404cae000 [2/fc0bf498-db79-4acf-b23e-1a6bd8ad90c4] handle_shut_down: mirror image no longer exists
>> 2020-02-12 20:13:28.638 7f393f1cb700  0 rbd::mirror::LeaderWatcher: 0x561403615200 handle_get_locker: breaking leader lock after 3 failed attempts to acquire
>> 2020-02-12 20:13:30.061 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bf600 [2/7fbf7e3e-22cc-4e78-b570-8938d6d13190] handle_process_entry_safe: failed to commit journal event: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:30.061 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bf600 [2/7fbf7e3e-22cc-4e78-b570-8938d6d13190] handle_replay_complete: replay encountered an error: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:30.064 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bed00 [2/7f2bbf53-566c-409f-926f-894a2648c175] handle_process_entry_safe: failed to commit journal event: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:30.064 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bed00 [2/7f2bbf53-566c-409f-926f-894a2648c175] handle_replay_complete: replay encountered an error: (108) Cannot send after transport endpoint shutdown
>> ...
>> many more of these
>> ...
>> 2020-02-12 20:13:30.121 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bdb00 [2/66f808b0-2d75-4ffa-b6a8-dc2ac934a0e5] handle_replay_complete: replay encountered an error: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:30.122 7f393f1cb700 -1 rbd::mirror::LeaderWatcher: 0x561403615200 handle_break_leader_lock: error breaking leader lock: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:30.320 7f393f1cb700 -1 rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x5614051d1280 handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:30.336 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561403c8a880 [2/ba137ddf-f520-4e2d-b56d-b5c572ca93b5] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.329 7f39339b4700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408125570 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.329 7f39339b4700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408125500 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.329 7f39339b4700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408125490 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
>> ...
>> many more of these
>> ...
>> 2020-02-12 20:13:33.333 7f39339b4700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561405dd7880 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.334 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80 start_image_replayer: global_image_id=ba137ddf-f520-4e2d-b56d-b5c572ca93b5: blacklisted detected during image replay
>> 2020-02-12 20:13:33.334 7f39339b4700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x56140369ad90 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
>> ...
>> many more of the "Cannot send after..." lines
>> ...
>> 2020-02-12 20:13:33.337 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561403c8ba80 [2/e41c5629-a158-4a5c-a3ae-5a971082cffa] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.337 7f39339b4700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561405c75d50 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.337 7f39339b4700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561405c75420 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.338 7f39339b4700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x56140451cbd0 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.338 7f39339b4700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408398230 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.338 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561403cd0d80 [2/e64b7f9a-eb24-4b46-a55e-f821e8f63353] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.339 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561404c0a000 [2/ed096674-37da-4988-9898-ae27fa3a82b6] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.340 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x56140362ff80 [2/f5711272-9fec-42ba-8a03-e65b8f3d55f0] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:33.340 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x56140362fb00 [2/fafed4a3-802f-4860-a523-0fc8d677d9cc] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80 start_image_replayer: global_image_id=17b2a7d9-aa98-4f85-b9da-bd27afee8aff: blacklisted detected during image replay
>> 2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80 start_image_replayer: global_image_id=1be8a0ce-68df-42ce-8678-eafeeb007ad3: blacklisted detected during image replay
>> 2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80 start_image_replayer: global_image_id=20d42977-e22a-41f2-b316-a4620991aee8: blacklisted detected during image replay
>> 2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80 start_image_replayer: global_image_id=2fd1ae38-3be2-45b7-b221-8021732a741a: blacklisted detected during image replay
>> ...
>> more and more of those
>> ...
>> 2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80 start_image_replayer: global_image_id=e2d9b334-f995-442a-bd52-c7acc22d4e63: blacklisted detected during image replay
>> 2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80 start_image_replayer: global_image_id=e41c5629-a158-4a5c-a3ae-5a971082cffa: blacklisted detected during image replay
>> 2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80 start_image_replayer: global_image_id=e64b7f9a-eb24-4b46-a55e-f821e8f63353: blacklisted detected during image replay
>> 2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80 start_image_replayer: global_image_id=ed096674-37da-4988-9898-ae27fa3a82b6: blacklisted detected during image replay
>> 2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80 start_image_replayer: global_image_id=f5711272-9fec-42ba-8a03-e65b8f3d55f0: blacklisted detected during image replay
>> 2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80 start_image_replayer: global_image_id=fafed4a3-802f-4860-a523-0fc8d677d9cc: blacklisted detected during image replay
>> ...
>>
>> Actually, right now, it still is in this state. I easily notice this because it's doing network I/O with ~4 MB/s in and out, via the interface facing the "target" cluster to which the images should be replayed.
>>
>>
>>
>> Checking the mon logs, it seems that the origin of the blacklisting this time:
>>
>> 2020-02-12 20:13:28.641 7f5dfad0c700  0 mon.mon001@0(leader) e3 handle_command mon_command({"prefix": "osd blacklist", "blacklistop": "add", "addr": "10.160.19.242:0/3907577364"} v 0) v1
>> 2020-02-12 20:13:28.641 7f5dfad0c700  0 log_channel(audit) log [INF] : from='client.? 10.160.19.241:0/993249630' entity='client.rbd_mirror_backup' cmd=[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "10.160.19.242:0/3907577364"}]: dispatch
>>
>> actually was another rbd-mirror asking to blacklist the "colleague". Checking the logs of rbd-mirror001 (10.160.19.241), I see:
>> ...
>> 2020-02-12 20:11:20.182 7f16d24d2040  0 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process rbd-mirror, pid 674
>> 2020-02-12 20:11:20.237 7f16d24d2040  1 mgrc service_daemon_register rbd-mirror.3245041 metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8)
>>  nautilus (stable),ceph_version_short=14.2.7,cpu=Intel(R) Xeon(R) CPU           E5410  @ 2.33GHz,distro=centos,distro_description=CentOS Linux 7 (Core),distro_version=7,hostname=rbd-mirror001.backup.example.com,id=rbd_mirror_backup,instance_id=3245041,kernel_description=#1 SMP Fri Dec 6 15:49:49 UTC 2019,kernel_version=3.10.0-1062.9.1.el7.x86_64,mem_swap_kb=5119996,mem_total_kb=16264948,os=Linux}
>> 2020-02-12 20:11:26.588 7f16b44c5700  0 rbd::mirror::ImageReplayer: 0x563e900bcd80 [2/5569d343-98fb-49c9-88cb-19251ce0e4cb] handle_shut_down: mirror image no longer exists
>> 2020-02-12 20:13:28.631 7f16bfcdc700  0 rbd::mirror::LeaderWatcher: 0x563e9002f200 handle_get_locker: breaking leader lock after 3 failed attempts to acquire
>> 2020-02-12 20:13:31.353 7f16b44c5700  0 rbd::mirror::ImageReplayer: 0x563e91510000 [2/5569d343-98fb-49c9-88cb-19251ce0e4cb] handle_shut_down: mirror image no longer exists
>> 2020-02-12 20:13:31.660 7f16bfcdc700 -1 rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92992d80 handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.660 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e916eb680 [2/0e614ece-65b1-4b4a-99bd-44dd6235eb70] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.660 7f16bfcdc700 -1 rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e913af500 handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.661 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e916ec880 [2/12513e0f-a87e-44ee-81aa-57de5aca739a] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.662 7f16bfcdc700 -1 rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92993780 handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.662 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e900bd680 [2/5949f093-8aee-41b3-b50f-5902d94ae90e] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.686 7f16bfcdc700 -1 rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92992e80 handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.686 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e916ec400 [2/11e30f19-7975-4dfe-94ae-d73b2cff7f61] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.709 7f16bfcdc700 -1 rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92992f00 handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.710 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e90047680 [2/1816f3b2-e5ae-461d-94bb-07a376249922] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.712 7f16a44a5700 -1 rbd::mirror::image_replayer::OpenLocalImageRequest: 0x563e923ab5c0 handle_open_image : failed to open image '3225d8dacb970d': (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.712 7f16bfcdc700  0 rbd::mirror::PoolWatcher: 0x563e913cdb00 handle_rewatch_complete: detected client is blacklisted
>> 2020-02-12 20:13:31.713 7f16a44a5700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x563e8f3ecfc0 handle_open_local_image: failed to open local image: (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.713 7f16a44a5700 -1 rbd::mirror::image_replayer::OpenLocalImageRequest: 0x563e91947560 handle_open_image : failed to open image '30272383ed48b0': (108) Cannot send after transport endpoint shutdown
>> 2020-02-12 20:13:31.713 7f16bfcdc700  0 rbd::mirror::PoolWatcher: 0x563e913cdb00 handle_rewatch_complete: detected client is blacklisted
>> 2020-02-12 20:13:31.714 7f16bfcdc700  0 rbd::mirror::PoolWatcher: 0x563e913cdb00 handle_rewatch_complete: detected client is blacklisted
>> ...
>>
>> So it "feels" like our three RBD mirrors start to fight (maybe after I/O congestion e.g. by restarting OSDs, high load,... ?) and sometimes blacklist each other in a frenzy.
>> The blacklisted one becomes rather unhappy and never recovers, though.
>>
>> Let me know if there is more info I can provide, I can (sometimes) enforce this by just restarting many OSDs in unison.
>>
>> Cheers,
>>         Oliver
>>
>>
>>>
>>>> 3) Restart RBD mirror if control socket becomes unresponsive. This catches the cases where (2) just hangs forever, and (1) does not fire yet because at least one functional RBD mirror is left (we have 3):
>>>> ----------------------
>>>> timeout 10 ceph daemon /var/run/ceph/ceph-client.${rbd_id}.\$(systemctl show --property MainPID ceph-rbd-mirror@${rbd_id}.service | sed 's/MainPID=//').*.asok help || [ $? -eq 124 ] && (systemctl -q is-active ceph-rbd-mirror@${rbd_id}.service && systemctl restart ceph-rbd-mirror@${rbd_id}.service)
>>>> ----------------------
>>>>
>>>> With this, we are running quite stable — with 3 RBD mirrors, there never is a real outage (and in any case, all issues seem to clearly correlate to restarts or short "hangs" of many OSD processes).
>>>>
>>>> Cheers and hope this helps somebody with similar issues,
>>>>         Oliver
>>>>
>>>> Am 27.12.19 um 02:43 schrieb Oliver Freyermuth:
>>>>> Dear Cephalopodians,
>>>>>
>>>>> for those following along through the holiday season, here's my "quick hack" for now, since our rbd mirrors keep going into "blacklisted" state whenever a bunch of OSDs restart in the cluster.
>>>>>
>>>>> For those not following along, nice holidays to you and hopefully some calm days off :-).
>>>>>
>>>>> To re-summarize: Once our rbd-mirrors are in that "blacklisted" state, they don't recover by themselves, so I think what is missing would be an auto-restart / reconnect after blacklisting
>>>>> (and, of course, an idea why the daemons' clients get blacklisted when OSDs restart?). Let me know if I should open a tracker issue on that one,
>>>>> or can provide more information (it happens every few nights for us).
>>>>>
>>>>> Since I was looking to restart them only in case of failure, I came up with some lengthy commands.
>>>>>
>>>>> I now have two cronjobs on the rbd-mirror daemon nodes set up, one which works "whatever happens", restarting an rbd mirror if any image sync is broken:
>>>>>
>>>>>   rbd --id=rbd_mirror_backup mirror pool status | egrep -q 'unknown|stopped' && systemctl -q is-active ceph-rbd-mirror@rbd_mirror_backup.service && systemctl restart ceph-rbd-mirror@rbd_mirror_backup.service
>>>>>
>>>>> I run this hourly. With multiple rbd mirrors, this does not catch everything, though. If we enter the failure state (blacklisted rbd mirror clients), this part only ensures at least one client recovers
>>>>> and takes over the full load. To get the other clients to restart only if they are also blacklisted, I do:
>>>>>
>>>>>   ceph daemon /var/run/ceph/ceph-client.rbd_mirror_backup.$(systemctl show --property MainPID ceph-rbd-mirror@rbd_mirror_backup.service | sed 's/MainPID=//').*.asok rbd mirror status | grep -q Replaying || (systemctl -q is-active ceph-rbd-mirror@rbd_mirror_backup.service && systemctl restart ceph-rbd-mirror@rbd_mirror_backup.service)
>>>>>
>>>>> This also runs hourly and queries the daemon state itself. If there's no image in "Replaying" state, something is wrong and the daemon is restarted.
>>>>> Technically, the latter cronjob should be sufficient, the first one is only there in case the daemons go completely awry (but I did not observe this up to now).
>>>>>
>>>>> I made two interesting observations, though:
>>>>> - It seems the log of the rbd-mirror is sometimes not filled with errors at all. The cause seems to be that the "rbd-mirror" processes are not SIGHUPed in the logrotate rule shipped with ceph-base.
>>>>>    I created a tracker issue here:
>>>>>     https://tracker.ceph.com/issues/43428
>>>>> - The output of the "rbd mirror status" command is not valid JSON, it is missing the trailing brace.
>>>>>    I created a tracker issue here:
>>>>>     https://tracker.ceph.com/issues/43429
>>>>>
>>>>> Cheers,
>>>>>       Oliver
>>>>>
>>>>> Am 24.12.19 um 04:39 schrieb Oliver Freyermuth:
>>>>>> Dear Cephalopodians,
>>>>>>
>>>>>> running 13.2.6 on the source cluster and 14.2.5 on the rbd mirror nodes and the target cluster,
>>>>>> I observe regular failures of rbd-mirror processes.
>>>>>>
>>>>>> With failures, I mean that traffic stops, but the daemons are still listed as active rbd-mirror daemons in
>>>>>> "ceph -s", and the daemons are still running. This comes in sync with a hefty load of below messages in the mirror logs.
>>>>>>
>>>>>> This happens "sometimes" when some OSDs go down and up in the target cluster (which happens each night since the disks in that cluster
>>>>>> shortly go offline during "online" smart self-tests - that's a problem in itself, but it's a cluster built from hardware that would have been trashed otherwise).
>>>>>>
>>>>>> The rbd daemons keep running in any case, but synchronization stops. If not all rbd mirror daemons have failed (we have three running, and it usually does not hit all of them),
>>>>>> the "surviving" seem(s) not to take care of the images the other daemons had locked.
>>>>>>
>>>>>> Right now, I am eyeing with a "quick solution" of regularly restarting the rbd-mirror daemons, but if there are any good ideas on which debug info I could collect
>>>>>> to get this analyzed and fixed, that would of course be appreciated :-).
> 
> The peer rbd-mirror processes will blacklist the lock owner / leader
> rbd-mirror daemon if it fails to heartbeat on time. What I don't see
> that I would expect to see is a "restarting blacklisted pool replayer
> for XYZ" log message about 30 seconds after after the first
> blacklisting was detected. Plus, after a daemon acquires the lock, I
> wouldn't expect to see it be blacklisted by another peer for at least
> another 10-15 seconds.
> 
> If you wouldn't mind, could you possibly increase the debug level for
> rbd-mirror (you can temporarily inject it via the admin socket) to
> "debug rbd_mirror = 15" level and attach the log to a new tracker
> ticket @ tracker.ceph.com?

Done, here it is:
https://tracker.ceph.com/issues/44159
As the ticket description says, I restarted OSDs and other RBD mirror daemons until the one with high log level got blacklisted,
and then left it logging for a while. 
If more is needed, just let me know :-). 

> You can also tweak the heartbeat timeouts for rbd-mirror daemon to
> hopefully mitigate OSD slow-downs:
> 
> rbd_mirror_leader_heartbeat_interval = 5 (default value -- sends a
> heartbeat to peers every 5 seconds)
> rbd_mirror_leader_max_missed_heartbeats = 2 (default value -- increase
> this to something like 20 for a 60 second grace period).
> rbd_mirror_leader_max_acquire_attempts_before_break = 3 (default value
> -- how many graceful attempts to transfer ownership before
> blacklisting peer)

Thanks! 
Since we can "survive" with the Cronjobs for now, I'll currently leave those settings as-is and provide debug info,
but that's good to know for the future (with the system at hand, short outages of some of the mirror daemons are tolerable). 

Cheers and thanks,
	Oliver

> 
>>>>>>
>>>>>> Cheers,
>>>>>>      Oliver
>>>>>>
>>>>>> -----------------------------------------------
>>>>>> 2019-12-24 02:08:51.379 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb968d00 [2/aabba863-89fd-4ea5-bb8c-0f417225d394] handle_process_entry_safe: failed to commit journal event: (108) Cannot send after transport endpoint shutdown
>>>>>> 2019-12-24 02:08:51.379 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb968d00 [2/aabba863-89fd-4ea5-bb8c-0f417225d394] handle_replay_complete: replay encountered an error: (108) Cannot send after transport endpoint shutdown
>>>>>> ...
>>>>>> 2019-12-24 02:08:54.392 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb87bb00 [2/23699357-a611-4557-9d73-6ff5279da991] handle_process_entry_safe: failed to commit journal event: (125) Operation canceled
>>>>>> 2019-12-24 02:08:54.392 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb87bb00 [2/23699357-a611-4557-9d73-6ff5279da991] handle_replay_complete: replay encountered an error: (125) Operation canceled
>>>>>> 2019-12-24 02:08:55.707 7f31ea358700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x559dce2e05b0 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
>>>>>> 2019-12-24 02:08:55.707 7f31ea358700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x559dcf47ee70 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
>>>>>> ...
>>>>>> 2019-12-24 02:08:55.716 7f31f5b6f700 -1 rbd::mirror::ImageReplayer: 0x559dcb997680 [2/f8218221-6608-4a2b-8831-84ca0c2cb418] operator(): start failed: (108) Cannot send after transport endpoint shutdown
>>>>>> 2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80 start_image_replayer: global_image_id=0577bd16-acc4-4e9a-81f0-c698a24f8771: blacklisted detected during image replay
>>>>>> 2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80 start_image_replayer: global_image_id=05bd4cca-a561-4a5c-ad83-9905ad5ce34e: blacklisted detected during image replay
>>>>>> 2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80 start_image_replayer: global_image_id=0e614ece-65b1-4b4a-99bd-44dd6235eb70: blacklisted detected during image replay
>>>>>> -----------------------------------------------
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>>
>>>
>>
>>
> 
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux