Re: Remove RBD mirror?

Magnus Grönlund <magnus@xxxxxxxxxxx> · Fri, 12 Apr 2019 16:47:53 +0200

Den fre 12 apr. 2019 kl 16:37 skrev Jason Dillaman <jdillama@xxxxxxxxxx>:
On Fri, Apr 12, 2019 at 9:52 AM Magnus Grönlund <magnus@xxxxxxxxxxx> wrote:

>

> Hi Jason,

>

> Tried to follow the instructions and setting the debug level to 15 worked OK, but the daemon appeared to silently ignore the restart command (nothing indicating a restart seen in the log).

> So I set the log level to 15 in the config file and restarted the rbd mirror daemon. The output surprised me though, my previous perception of the issue might be completely wrong...

> Lots of "image_replayer::BootstrapRequest:.... failed to create local image: (2) No such file or directory" and ":ImageReplayer: ....  replay encountered an error: (42) No message of desired type"

What is the result from "rbd mirror pool status --verbose nova"

against your DR cluster now? Are they in up+error now? The ENOENT

errors most likely related to a parent image that hasn't been

mirrored. The ENOMSG error seems to indicate that there might be some

corruption in a journal and it's missing expected records (like a

production client crashed), but it should be able to recover from

that

# rbd mirror pool status --verbose nova
health: WARNING
images: 2479 total
    2479 unknown

002344ab-c324-4c01-97ff-de32868fa712_disk:
  global_id:   c02e0202-df8f-46ce-a4b6-1a50a9692804
  state:       down+unknown
  description: status not found
  last_update:

002a8fde-3a63-4e32-9c18-b0bf64393d0f_disk:
  global_id:   d412abc4-b37e-44a2-8aba-107f352dec60
  state:       down+unknown
  description: status not found
  last_update:

<Repeat 2477 times>

> https://pastebin.com/1bTETNGs

>

> Best regards

> /Magnus

>

> Den tis 9 apr. 2019 kl 18:35 skrev Jason Dillaman <jdillama@xxxxxxxxxx>:

>>

>> Can you pastebin the results from running the following on your backup

>> site rbd-mirror daemon node?

>>

>> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 15

>> ceph --admin-socket /path/to/asok rbd mirror restart nova

>> .... wait a minute to let some logs accumulate ...

>> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 0/5

>>

>> ... and collect the rbd-mirror log from /var/log/ceph/ (should have

>> lots of "rbd::mirror"-like log entries.

>>

>>

>> On Tue, Apr 9, 2019 at 12:23 PM Magnus Grönlund <magnus@xxxxxxxxxxx> wrote:

>> >

>> >

>> >

>> > Den tis 9 apr. 2019 kl 17:48 skrev Jason Dillaman <jdillama@xxxxxxxxxx>:

>> >>

>> >> Any chance your rbd-mirror daemon has the admin sockets available

>> >> (defaults to /var/run/ceph/cephdr-client.<id>.<pid>.<random>.asok)? If

>> >> so, you can run "ceph --admin-daemon /path/to/asok rbd mirror status".

>> >

>> >

>> > {

>> >     "pool_replayers": [

>> >         {

>> >             "pool": "glance",

>> >             "peer": "uuid: df30fb21-d1de-4c3a-9c00-10eaa4b30e00 cluster: production client: client.productionbackup",

>> >             "instance_id": "869081",

>> >             "leader_instance_id": "869081",

>> >             "leader": true,

>> >             "instances": [],

>> >             "local_cluster_admin_socket": "/var/run/ceph/client.backup.1936211.backup.94225674131712.asok",

>> >             "remote_cluster_admin_socket": "/var/run/ceph/client.productionbackup.1936211.production.94225675210000.asok",

>> >             "sync_throttler": {

>> >                 "max_parallel_syncs": 5,

>> >                 "running_syncs": 0,

>> >                 "waiting_syncs": 0

>> >             },

>> >             "image_replayers": [

>> >                 {

>> >                     "name": "glance/ea5e4ad2-090a-4665-b142-5c7a095963e0",

>> >                     "state": "Replaying"

>> >                 },

>> >                 {

>> >                     "name": "glance/d7095183-45ef-40b5-80ef-f7c9d3bb1e62",

>> >                     "state": "Replaying"

>> >                 },

>> > -------------------cut----------

>> >                 {

>> >                     "name": "cinder/volume-bcb41f46-3716-4ee2-aa19-6fbc241fbf05",

>> >                     "state": "Replaying"

>> >                 }

>> >             ]

>> >         },

>> >          {

>> >             "pool": "nova",

>> >             "peer": "uuid: 1fc7fefc-9bcb-4f36-a259-66c3d8086702 cluster: production client: client.productionbackup",

>> >             "instance_id": "889074",

>> >             "leader_instance_id": "889074",

>> >             "leader": true,

>> >             "instances": [],

>> >             "local_cluster_admin_socket": "/var/run/ceph/client.backup.1936211.backup.94225678548048.asok",

>> >             "remote_cluster_admin_socket": "/var/run/ceph/client.productionbackup.1936211.production.94225679621728.asok",

>> >             "sync_throttler": {

>> >                 "max_parallel_syncs": 5,

>> >                 "running_syncs": 0,

>> >                 "waiting_syncs": 0

>> >             },

>> >             "image_replayers": []

>> >         }

>> >     ],

>> >     "image_deleter": {

>> >         "image_deleter_status": {

>> >             "delete_images_queue": [

>> >                 {

>> >                     "local_pool_id": 3,

>> >                     "global_image_id": "ff531159-de6f-4324-a022-50c079dedd45"

>> >                 }

>> >             ],

>> >             "failed_deletes_queue": []

>> >         }

>> >>

>> >>

>> >> On Tue, Apr 9, 2019 at 11:26 AM Magnus Grönlund <magnus@xxxxxxxxxxx> wrote:

>> >> >

>> >> >

>> >> >

>> >> > Den tis 9 apr. 2019 kl 17:14 skrev Jason Dillaman <jdillama@xxxxxxxxxx>:

>> >> >>

>> >> >> On Tue, Apr 9, 2019 at 11:08 AM Magnus Grönlund <magnus@xxxxxxxxxxx> wrote:

>> >> >> >

>> >> >> > >On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönlund <magnus@xxxxxxxxxxx> wrote:

>> >> >> > >>

>> >> >> > >> Hi,

>> >> >> > >> We have configured one-way replication of pools between a production cluster and a backup cluster. But unfortunately the rbd-mirror or the backup cluster is unable to keep up with the production cluster so the replication fails to reach replaying state.

>> >> >> > >

>> >> >> > >Hmm, it's odd that they don't at least reach the replaying state. Are

>> >> >> > >they still performing the initial sync?

>> >> >> >

>> >> >> > There are three pools we try to mirror, (glance, cinder, and nova, no points for guessing what the cluster is used for :) ),

>> >> >> > the glance and cinder pools are smaller and sees limited write activity, and the mirroring works, the nova pool which is the largest and has 90% of the write activity never leaves the "unknown" state.

>> >> >> >

>> >> >> > # rbd mirror pool status cinder

>> >> >> > health: OK

>> >> >> > images: 892 total

>> >> >> >     890 replaying

>> >> >> >     2 stopped

>> >> >> > #

>> >> >> > # rbd mirror pool status nova

>> >> >> > health: WARNING

>> >> >> > images: 2479 total

>> >> >> >     2479 unknown

>> >> >> > #

>> >> >> > The production clsuter has 5k writes/s on average and the backup cluster has 1-2k writes/s on average. The production cluster is bigger and has better specs. I thought that the backup cluster would be able to keep up but it looks like I was wrong.

>> >> >>

>> >> >> The fact that they are in the unknown state just means that the remote

>> >> >> "rbd-mirror" daemon hasn't started any journal replayers against the

>> >> >> images. If it couldn't keep up, it would still report a status of

>> >> >> "up+replaying". What Ceph release are you running on your backup

>> >> >> cluster?

>> >> >>

>> >> > The backup cluster is running Luminous 12.2.11 (the production cluster 12.2.10)

>> >> >

>> >> >>

>> >> >> > >> And the journals on the rbd volumes keep growing...

>> >> >> > >>

>> >> >> > >> Is it enough to simply disable the mirroring of the pool  (rbd mirror pool disable <pool>) and that will remove the lagging reader from the journals and shrink them, or is there anything else that has to be done?

>> >> >> > >

>> >> >> > >You can either disable the journaling feature on the image(s) since

>> >> >> > >there is no point to leave it on if you aren't using mirroring, or run

>> >> >> > >"rbd mirror pool disable <pool>" to purge the journals.

>> >> >> >

>> >> >> > Thanks for the confirmation.

>> >> >> > I will stop the mirror of the nova pool and try to figure out if there is anything we can do to get the backup cluster to keep up.

>> >> >> >

>> >> >> > >> Best regards

>> >> >> > >> /Magnus

>> >> >> > >> _______________________________________________

>> >> >> > >> ceph-users mailing list

>> >> >> > >> ceph-users@xxxxxxxxxxxxxx

>> >> >> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >> >> > >

>> >> >> > >--

>> >> >> > >Jason

>> >> >>

>> >> >>

>> >> >>

>> >> >> --

>> >> >> Jason

>> >>

>> >>

>> >>

>> >> --

>> >> Jason

>>

>>

>>

>> --

>> Jason

-- 

Jason

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com