Eugen, it worked and it didn't. I had to bootstrap in v17.2.3, using v17.2.4 this behavior is occurring. I did numerous tests with 3 VMs, two with disks and another only for MON, in v17.2.4 the cluster simply crashes when one of the hosts with disk dies even with three MONs. I don't understand why this happened. Em sex., 14 de out. de 2022 às 03:53, Eugen Block <eblock@xxxxxx> escreveu: > To me this sounds more like either your MONs didn't have a quorum > anymore or your clients didn't have all MONs in their ceph.conf, maybe > just the failed one? Then the issue is resolved now? > > Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>: > > > Unfortunately I can't verify if ceph reports any inactive PG. As soon as > > the second host disconnects practically everything is locked, nothing > > appears even using "ceph -w". It only appears that the OSDs are offline > > when dcs2 returns. > > > > Note: Apparently there was a new update recently. When I was in the test > > environment, this behavior was not happening, dcs1 was UP with all > services > > without crashing even with dcs2 DOWN, performing reading and writing, > even > > without adding dcs3. > > > > ### COMMANDS ### > > [ceph: root@dcs1 /]# ceph osd tree > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > -1 65.49570 root default > > -3 32.74785 host dcs1 > > 0 hdd 2.72899 osd.0 up 1.00000 1.00000 > > 1 hdd 2.72899 osd.1 up 1.00000 1.00000 > > 2 hdd 2.72899 osd.2 up 1.00000 1.00000 > > 3 hdd 2.72899 osd.3 up 1.00000 1.00000 > > 4 hdd 2.72899 osd.4 up 1.00000 1.00000 > > 5 hdd 2.72899 osd.5 up 1.00000 1.00000 > > 6 hdd 2.72899 osd.6 up 1.00000 1.00000 > > 7 hdd 2.72899 osd.7 up 1.00000 1.00000 > > 8 hdd 2.72899 osd.8 up 1.00000 1.00000 > > 9 hdd 2.72899 osd.9 up 1.00000 1.00000 > > 10 hdd 2.72899 osd.10 up 1.00000 1.00000 > > 11 hdd 2.72899 osd.11 up 1.00000 1.00000 > > -5 32.74785 host dcs2 > > 12 hdd 2.72899 osd.12 up 1.00000 1.00000 > > 13 hdd 2.72899 osd.13 up 1.00000 1.00000 > > 14 hdd 2.72899 osd.14 up 1.00000 1.00000 > > 15 hdd 2.72899 osd.15 up 1.00000 1.00000 > > 16 hdd 2.72899 osd.16 up 1.00000 1.00000 > > 17 hdd 2.72899 osd.17 up 1.00000 1.00000 > > 18 hdd 2.72899 osd.18 up 1.00000 1.00000 > > 19 hdd 2.72899 osd.19 up 1.00000 1.00000 > > 20 hdd 2.72899 osd.20 up 1.00000 1.00000 > > 21 hdd 2.72899 osd.21 up 1.00000 1.00000 > > 22 hdd 2.72899 osd.22 up 1.00000 1.00000 > > 23 hdd 2.72899 osd.23 up 1.00000 1.00000 > > > > > > [ceph: root@dcs1 /]# ceph osd pool ls detail > > pool 1 '.mgr' replicated size 2 min_size 1 crush_rule 0 object_hash > > rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 26 flags > > hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr > > pool 2 'cephfs.ovirt_hosted_engine.meta' replicated size 2 min_size 1 > > crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on > > last_change 77 lfor 0/0/47 flags hashpspool stripe_width 0 > > pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs > > pool 3 'cephfs.ovirt_hosted_engine.data' replicated size 2 min_size 1 > > crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on > > last_change 179 lfor 0/0/47 flags hashpspool max_bytes 107374182400 > > stripe_width 0 application cephfs > > pool 6 '.nfs' replicated size 2 min_size 1 crush_rule 0 object_hash > > rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 254 lfor > > 0/0/252 flags hashpspool stripe_width 0 application nfs > > pool 7 'cephfs.ovirt_storage_sas.meta' replicated size 2 min_size 1 > > crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on > > last_change 322 lfor 0/0/287 flags hashpspool stripe_width 0 > > pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs > > pool 8 'cephfs.ovirt_storage_sas.data' replicated size 2 min_size 1 > > crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on > > last_change 291 lfor 0/0/289 flags hashpspool stripe_width 0 application > > cephfs > > pool 9 'cephfs.ovirt_storage_iso.meta' replicated size 2 min_size 1 > > crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on > > last_change 356 lfor 0/0/325 flags hashpspool stripe_width 0 > > pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs > > pool 10 'cephfs.ovirt_storage_iso.data' replicated size 2 min_size 1 > > crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on > > last_change 329 lfor 0/0/327 flags hashpspool stripe_width 0 application > > cephfs > > > > > > [ceph: root@dcs1 /]# ceph osd crush rule dump replicated_rule > > { > > "rule_id": 0, > > "rule_name": "replicated_rule", > > "type": 1, > > "steps": [ > > { > > "op": "take", > > "item": -1, > > "item_name": "default" > > }, > > { > > "op": "chooseleaf_firstn", > > "num": 0, > > "type": "host" > > }, > > { > > "op": "emit" > > } > > ] > > } > > > > > > [ceph: root@dcs1 /]# ceph pg ls-by-pool cephfs.ovirt_hosted_engine.data > > PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* > > OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP > > ACTING SCRUB_STAMP DEEP_SCRUB_STAMP > > LAST_SCRUB_DURATION SCRUB_SCHEDULING > > 3.0 69 0 0 0 285213095 0 > > 0 10057 active+clean 41m 530'20632 530:39461 [1,23]p1 > > [1,23]p1 2022-10-13T03:19:33.649837+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T07:24:46.314217+0000 > > 3.1 58 0 0 0 242319360 0 > > 0 10026 active+clean 41m 530'11926 530:21424 [6,19]p6 > > [6,19]p6 2022-10-13T02:15:23.395162+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T11:42:17.682881+0000 > > 3.2 71 0 0 0 294629376 0 > > 0 10012 active+clean 41m 530'12312 530:25506 [10,16]p10 > > [10,16]p10 2022-10-13T06:12:48.839013+0000 > > 2022-10-11T21:09:49.405860+0000 1 periodic scrub > > scheduled @ 2022-10-14T12:35:23.917129+0000 > > 3.3 63 0 0 0 262520832 0 > > 0 10073 active+clean 41m 530'20204 530:42834 [13,11]p13 > > [13,11]p13 2022-10-13T01:16:17.672947+0000 > > 2022-10-11T16:43:27.935298+0000 1 periodic scrub > > scheduled @ 2022-10-14T11:48:42.643271+0000 > > 3.4 59 0 0 0 240611328 0 > > 0 10017 active+clean 41m 530'17883 530:32537 [10,22]p10 > > [10,22]p10 2022-10-12T22:09:09.376552+0000 > > 2022-10-10T15:00:52.196397+0000 1 periodic scrub > > scheduled @ 2022-10-14T01:16:35.682204+0000 > > 3.5 67 0 0 0 281018368 0 > > 0 10017 active+clean 41m 530'18825 530:31531 [18,3]p18 > > [18,3]p18 2022-10-12T18:13:50.835870+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T02:17:12.292237+0000 > > 3.6 60 0 0 0 239497216 0 > > 0 10079 active+clean 41m 530'22537 530:34790 [0,21]p0 > > [0,21]p0 2022-10-12T20:38:44.998414+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T08:12:12.106892+0000 > > 3.7 54 0 0 0 221261824 0 > > 0 10082 active+clean 41m 530'30718 530:37349 [4,12]p4 > > [4,12]p4 2022-10-12T20:26:54.091307+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-13T20:51:54.792643+0000 > > 3.8 70 0 0 0 293588992 0 > > 0 4527 active+clean 41m 530'4527 530:16905 [11,21]p11 > > [11,21]p11 2022-10-13T07:16:50.226814+0000 > > 2022-10-10T14:57:31.136809+0000 1 periodic scrub > > scheduled @ 2022-10-14T13:02:27.444761+0000 > > 3.9 47 0 0 0 192938407 0 > > 0 10065 active+clean 41m 530'11065 530:21345 [19,11]p19 > > [19,11]p19 2022-10-13T05:05:36.274216+0000 > > 2022-10-10T14:57:31.136809+0000 1 periodic scrub > > scheduled @ 2022-10-14T08:17:25.165367+0000 > > 3.a 60 0 0 0 251658240 0 > > 0 10044 active+clean 41m 530'14744 530:23145 [18,1]p18 > > [18,1]p18 2022-10-13T04:29:38.891055+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T11:10:38.556482+0000 > > 3.b 52 0 0 0 209567744 0 > > 0 4949 active+clean 41m 530'4949 530:26757 [7,23]p7 > > [7,23]p7 2022-10-12T22:08:45.621201+0000 > 2022-10-10T15:00:36.799456+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T02:28:08.061560+0000 > > 3.c 68 0 0 0 276607307 0 > > 0 10003 active+clean 41m 530'18828 530:39884 [18,8]p18 > > [18,8]p18 2022-10-12T18:25:36.991393+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T00:43:12.804024+0000 > > 3.d 67 0 0 0 272621888 0 > > 0 6708 active+clean 41m 530'8359 530:33988 [13,7]p13 > > [13,7]p13 2022-10-12T21:42:29.600145+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-13T23:30:29.341646+0000 > > 3.e 68 0 0 0 276746240 0 > > 0 5178 active+clean 41m 530'5278 530:16051 [13,1]p13 > > [13,1]p13 2022-10-13T05:47:06.004714+0000 > 2022-10-11T21:04:57.978685+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T11:45:33.438178+0000 > > 3.f 65 0 0 0 269307904 0 > > 0 10056 active+clean 41m 530'34965 530:49963 [23,4]p23 > > [23,4]p23 2022-10-13T08:58:09.493284+0000 > 2022-10-10T15:00:36.390467+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T12:18:58.610252+0000 > > 3.10 66 0 0 0 271626240 0 > > 0 4272 active+clean 41m 530'4431 530:19010 [12,9]p12 > > [12,9]p12 2022-10-13T03:52:14.952046+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T07:48:12.441144+0000 > > 3.11 58 0 0 0 239075657 0 > > 0 6466 active+clean 41m 530'8563 530:24677 [18,0]p18 > > [18,0]p18 2022-10-12T22:25:17.255090+0000 > 2022-10-10T15:00:43.412084+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T03:25:34.048845+0000 > > 3.12 45 0 0 0 186254336 0 > > 0 10084 active+clean 41m 530'16084 530:31273 [6,14]p6 > > [6,14]p6 2022-10-13T03:05:14.109923+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T03:35:11.159743+0000 > > 3.13 68 0 0 0 275124224 0 > > 0 10013 active+clean 41m 530'28676 530:52278 [16,8]p16 > > [16,8]p16 2022-10-12T21:46:50.747741+0000 > 2022-10-11T16:48:56.632027+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T07:03:49.125496+0000 > > 3.14 58 0 0 0 240123904 0 > > 0 7531 active+clean 41m 530'8212 530:26075 [23,4]p23 > > [23,4]p23 2022-10-13T04:25:39.131070+0000 > 2022-10-13T04:25:39.131070+0000 > > 4 periodic scrub scheduled @ > > 2022-10-14T05:36:16.428326+0000 > > 3.15 59 0 0 0 247382016 0 > > 0 8890 active+clean 41m 530'8890 530:18892 [23,3]p23 > > [23,3]p23 2022-10-13T04:45:48.156899+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T14:55:14.651919+0000 > > 3.16 57 0 0 0 237285376 0 > > 0 6900 active+clean 41m 530'8766 530:20717 [19,9]p19 > > [19,9]p19 2022-10-13T00:13:35.716060+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T07:08:16.779024+0000 > > 3.17 56 0 0 0 234303488 0 > > 0 10012 active+clean 41m 530'21461 530:31490 [0,13]p0 > > [0,13]p0 2022-10-13T07:42:57.775955+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T14:52:30.758744+0000 > > 3.18 47 0 0 0 197132288 0 > > 0 10001 active+clean 41m 530'14783 530:20829 [10,14]p10 > > [10,14]p10 2022-10-13T00:41:44.050740+0000 > > 2022-10-10T14:57:31.136809+0000 1 periodic scrub > > scheduled @ 2022-10-14T09:30:02.438044+0000 > > 3.19 50 0 0 0 209715200 0 > > 0 10058 active+clean 41m 499'19880 530:27891 [8,23]p8 > > [8,23]p8 2022-10-13T10:58:13.948274+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T19:55:12.268345+0000 > > 3.1a 58 0 0 0 240123904 0 > > 0 10037 active+clean 41m 530'36799 530:50997 [16,9]p16 > > [16,9]p16 2022-10-13T02:03:18.026427+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T04:55:58.684437+0000 > > 3.1b 53 0 0 0 219996160 0 > > 0 10051 active+clean 41m 530'18388 530:29223 [0,22]p0 > > [0,22]p0 2022-10-12T19:19:25.675030+0000 > 2022-10-12T19:19:25.675030+0000 > > 4 periodic scrub scheduled @ > > 2022-10-14T00:21:49.935082+0000 > > 3.1c 66 0 0 0 276762624 0 > > 0 10027 active+clean 41m 530'16327 530:38127 [20,5]p20 > > [20,5]p20 2022-10-13T00:04:49.227288+0000 > 2022-10-10T15:00:38.834351+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T01:15:26.524544+0000 > > 3.1d 49 0 0 0 201327104 0 > > 0 10020 active+clean 41m 530'26433 530:51593 [17,9]p17 > > [17,9]p17 2022-10-13T03:49:02.466987+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T09:04:39.909179+0000 > > 3.1e 61 0 0 0 249098595 0 > > 0 8790 active+clean 41m 530'8790 530:17807 [3,21]p3 > > [3,21]p3 2022-10-12T22:28:19.417597+0000 > 2022-10-10T15:00:39.474873+0000 > > 1 periodic scrub scheduled @ > > 2022-10-13T23:49:55.974786+0000 > > 3.1f 53 0 0 0 222056448 0 > > 0 10053 active+clean 41m 530'35776 530:50234 [0,15]p0 > > [0,15]p0 2022-10-13T07:16:46.787818+0000 > 2022-10-10T14:57:31.136809+0000 > > 1 periodic scrub scheduled @ > > 2022-10-14T16:24:45.860894+0000 > > > > * NOTE: Omap statistics are gathered during deep scrub and may be > > inaccurate soon afterwards depending on utilization. See > > http://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics for > > further details. > > > > Em qui., 13 de out. de 2022 às 13:54, Eugen Block <eblock@xxxxxx> > escreveu: > > > >> Could you share more details? Does ceph report inactive PGs when one > >> node is down? Please share: > >> ceph osd tree > >> ceph osd pool ls detail > >> ceph osd crush rule dump <rule of affected pool> > >> ceph pg ls-by-pool <affected pool> > >> ceph -s > >> > >> Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>: > >> > >> > Thanks for answering. > >> > Marc, but there is no mechanism to prevent IO pause? At the moment I > >> don't > >> > worry about data loss. > >> > I understand that putting it as replica x1 can work, but I need it to > be > >> x2. > >> > > >> > Em qui., 13 de out. de 2022 às 12:26, Marc <Marc@xxxxxxxxxxxxxxxxx> > >> > escreveu: > >> > > >> >> > >> >> > > >> >> > I'm having strange behavior on a new cluster. > >> >> > >> >> Not strange, by design > >> >> > >> >> > I have 3 machines, two of them have the disks. We can name them > like > >> >> > this: > >> >> > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks. > >> >> > > >> >> > I started bootstrapping through dcs1, added the other hosts and > left > >> mgr > >> >> > on > >> >> > dcs3 only. > >> >> > > >> >> > What is happening is that if I take down dcs2 everything hangs and > >> >> > becomes > >> >> > irresponsible, including the mount points that were pointed to > dcs1. > >> >> > >> >> You have to have disks in 3 machines. (Or set the replication to 1x) > >> >> > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx