To me this sounds more like either your MONs didn't have a quorum
anymore or your clients didn't have all MONs in their ceph.conf, maybe
just the failed one? Then the issue is resolved now?
Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:
> Unfortunately I can't verify if ceph reports any inactive PG. As soon as
> the second host disconnects practically everything is locked, nothing
> appears even using "ceph -w". It only appears that the OSDs are offline
> when dcs2 returns.
>
> Note: Apparently there was a new update recently. When I was in the test
> environment, this behavior was not happening, dcs1 was UP with all
services
> without crashing even with dcs2 DOWN, performing reading and writing,
even
> without adding dcs3.
>
> ### COMMANDS ###
> [ceph: root@dcs1 /]# ceph osd tree
> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
> -1 65.49570 root default
> -3 32.74785 host dcs1
> 0 hdd 2.72899 osd.0 up 1.00000 1.00000
> 1 hdd 2.72899 osd.1 up 1.00000 1.00000
> 2 hdd 2.72899 osd.2 up 1.00000 1.00000
> 3 hdd 2.72899 osd.3 up 1.00000 1.00000
> 4 hdd 2.72899 osd.4 up 1.00000 1.00000
> 5 hdd 2.72899 osd.5 up 1.00000 1.00000
> 6 hdd 2.72899 osd.6 up 1.00000 1.00000
> 7 hdd 2.72899 osd.7 up 1.00000 1.00000
> 8 hdd 2.72899 osd.8 up 1.00000 1.00000
> 9 hdd 2.72899 osd.9 up 1.00000 1.00000
> 10 hdd 2.72899 osd.10 up 1.00000 1.00000
> 11 hdd 2.72899 osd.11 up 1.00000 1.00000
> -5 32.74785 host dcs2
> 12 hdd 2.72899 osd.12 up 1.00000 1.00000
> 13 hdd 2.72899 osd.13 up 1.00000 1.00000
> 14 hdd 2.72899 osd.14 up 1.00000 1.00000
> 15 hdd 2.72899 osd.15 up 1.00000 1.00000
> 16 hdd 2.72899 osd.16 up 1.00000 1.00000
> 17 hdd 2.72899 osd.17 up 1.00000 1.00000
> 18 hdd 2.72899 osd.18 up 1.00000 1.00000
> 19 hdd 2.72899 osd.19 up 1.00000 1.00000
> 20 hdd 2.72899 osd.20 up 1.00000 1.00000
> 21 hdd 2.72899 osd.21 up 1.00000 1.00000
> 22 hdd 2.72899 osd.22 up 1.00000 1.00000
> 23 hdd 2.72899 osd.23 up 1.00000 1.00000
>
>
> [ceph: root@dcs1 /]# ceph osd pool ls detail
> pool 1 '.mgr' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 26 flags
> hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
> pool 2 'cephfs.ovirt_hosted_engine.meta' replicated size 2 min_size 1
> crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on
> last_change 77 lfor 0/0/47 flags hashpspool stripe_width 0
> pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
> pool 3 'cephfs.ovirt_hosted_engine.data' replicated size 2 min_size 1
> crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> last_change 179 lfor 0/0/47 flags hashpspool max_bytes 107374182400
> stripe_width 0 application cephfs
> pool 6 '.nfs' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 254 lfor
> 0/0/252 flags hashpspool stripe_width 0 application nfs
> pool 7 'cephfs.ovirt_storage_sas.meta' replicated size 2 min_size 1
> crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on
> last_change 322 lfor 0/0/287 flags hashpspool stripe_width 0
> pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
> pool 8 'cephfs.ovirt_storage_sas.data' replicated size 2 min_size 1
> crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> last_change 291 lfor 0/0/289 flags hashpspool stripe_width 0 application
> cephfs
> pool 9 'cephfs.ovirt_storage_iso.meta' replicated size 2 min_size 1
> crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on
> last_change 356 lfor 0/0/325 flags hashpspool stripe_width 0
> pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
> pool 10 'cephfs.ovirt_storage_iso.data' replicated size 2 min_size 1
> crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> last_change 329 lfor 0/0/327 flags hashpspool stripe_width 0 application
> cephfs
>
>
> [ceph: root@dcs1 /]# ceph osd crush rule dump replicated_rule
> {
> "rule_id": 0,
> "rule_name": "replicated_rule",
> "type": 1,
> "steps": [
> {
> "op": "take",
> "item": -1,
> "item_name": "default"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> }
>
>
> [ceph: root@dcs1 /]# ceph pg ls-by-pool cephfs.ovirt_hosted_engine.data
> PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES*
> OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP
> ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
> LAST_SCRUB_DURATION SCRUB_SCHEDULING
> 3.0 69 0 0 0 285213095 0
> 0 10057 active+clean 41m 530'20632 530:39461 [1,23]p1
> [1,23]p1 2022-10-13T03:19:33.649837+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T07:24:46.314217+0000
> 3.1 58 0 0 0 242319360 0
> 0 10026 active+clean 41m 530'11926 530:21424 [6,19]p6
> [6,19]p6 2022-10-13T02:15:23.395162+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T11:42:17.682881+0000
> 3.2 71 0 0 0 294629376 0
> 0 10012 active+clean 41m 530'12312 530:25506 [10,16]p10
> [10,16]p10 2022-10-13T06:12:48.839013+0000
> 2022-10-11T21:09:49.405860+0000 1 periodic scrub
> scheduled @ 2022-10-14T12:35:23.917129+0000
> 3.3 63 0 0 0 262520832 0
> 0 10073 active+clean 41m 530'20204 530:42834 [13,11]p13
> [13,11]p13 2022-10-13T01:16:17.672947+0000
> 2022-10-11T16:43:27.935298+0000 1 periodic scrub
> scheduled @ 2022-10-14T11:48:42.643271+0000
> 3.4 59 0 0 0 240611328 0
> 0 10017 active+clean 41m 530'17883 530:32537 [10,22]p10
> [10,22]p10 2022-10-12T22:09:09.376552+0000
> 2022-10-10T15:00:52.196397+0000 1 periodic scrub
> scheduled @ 2022-10-14T01:16:35.682204+0000
> 3.5 67 0 0 0 281018368 0
> 0 10017 active+clean 41m 530'18825 530:31531 [18,3]p18
> [18,3]p18 2022-10-12T18:13:50.835870+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T02:17:12.292237+0000
> 3.6 60 0 0 0 239497216 0
> 0 10079 active+clean 41m 530'22537 530:34790 [0,21]p0
> [0,21]p0 2022-10-12T20:38:44.998414+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T08:12:12.106892+0000
> 3.7 54 0 0 0 221261824 0
> 0 10082 active+clean 41m 530'30718 530:37349 [4,12]p4
> [4,12]p4 2022-10-12T20:26:54.091307+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-13T20:51:54.792643+0000
> 3.8 70 0 0 0 293588992 0
> 0 4527 active+clean 41m 530'4527 530:16905 [11,21]p11
> [11,21]p11 2022-10-13T07:16:50.226814+0000
> 2022-10-10T14:57:31.136809+0000 1 periodic scrub
> scheduled @ 2022-10-14T13:02:27.444761+0000
> 3.9 47 0 0 0 192938407 0
> 0 10065 active+clean 41m 530'11065 530:21345 [19,11]p19
> [19,11]p19 2022-10-13T05:05:36.274216+0000
> 2022-10-10T14:57:31.136809+0000 1 periodic scrub
> scheduled @ 2022-10-14T08:17:25.165367+0000
> 3.a 60 0 0 0 251658240 0
> 0 10044 active+clean 41m 530'14744 530:23145 [18,1]p18
> [18,1]p18 2022-10-13T04:29:38.891055+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T11:10:38.556482+0000
> 3.b 52 0 0 0 209567744 0
> 0 4949 active+clean 41m 530'4949 530:26757 [7,23]p7
> [7,23]p7 2022-10-12T22:08:45.621201+0000
2022-10-10T15:00:36.799456+0000
> 1 periodic scrub scheduled @
> 2022-10-14T02:28:08.061560+0000
> 3.c 68 0 0 0 276607307 0
> 0 10003 active+clean 41m 530'18828 530:39884 [18,8]p18
> [18,8]p18 2022-10-12T18:25:36.991393+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T00:43:12.804024+0000
> 3.d 67 0 0 0 272621888 0
> 0 6708 active+clean 41m 530'8359 530:33988 [13,7]p13
> [13,7]p13 2022-10-12T21:42:29.600145+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-13T23:30:29.341646+0000
> 3.e 68 0 0 0 276746240 0
> 0 5178 active+clean 41m 530'5278 530:16051 [13,1]p13
> [13,1]p13 2022-10-13T05:47:06.004714+0000
2022-10-11T21:04:57.978685+0000
> 1 periodic scrub scheduled @
> 2022-10-14T11:45:33.438178+0000
> 3.f 65 0 0 0 269307904 0
> 0 10056 active+clean 41m 530'34965 530:49963 [23,4]p23
> [23,4]p23 2022-10-13T08:58:09.493284+0000
2022-10-10T15:00:36.390467+0000
> 1 periodic scrub scheduled @
> 2022-10-14T12:18:58.610252+0000
> 3.10 66 0 0 0 271626240 0
> 0 4272 active+clean 41m 530'4431 530:19010 [12,9]p12
> [12,9]p12 2022-10-13T03:52:14.952046+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T07:48:12.441144+0000
> 3.11 58 0 0 0 239075657 0
> 0 6466 active+clean 41m 530'8563 530:24677 [18,0]p18
> [18,0]p18 2022-10-12T22:25:17.255090+0000
2022-10-10T15:00:43.412084+0000
> 1 periodic scrub scheduled @
> 2022-10-14T03:25:34.048845+0000
> 3.12 45 0 0 0 186254336 0
> 0 10084 active+clean 41m 530'16084 530:31273 [6,14]p6
> [6,14]p6 2022-10-13T03:05:14.109923+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T03:35:11.159743+0000
> 3.13 68 0 0 0 275124224 0
> 0 10013 active+clean 41m 530'28676 530:52278 [16,8]p16
> [16,8]p16 2022-10-12T21:46:50.747741+0000
2022-10-11T16:48:56.632027+0000
> 1 periodic scrub scheduled @
> 2022-10-14T07:03:49.125496+0000
> 3.14 58 0 0 0 240123904 0
> 0 7531 active+clean 41m 530'8212 530:26075 [23,4]p23
> [23,4]p23 2022-10-13T04:25:39.131070+0000
2022-10-13T04:25:39.131070+0000
> 4 periodic scrub scheduled @
> 2022-10-14T05:36:16.428326+0000
> 3.15 59 0 0 0 247382016 0
> 0 8890 active+clean 41m 530'8890 530:18892 [23,3]p23
> [23,3]p23 2022-10-13T04:45:48.156899+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T14:55:14.651919+0000
> 3.16 57 0 0 0 237285376 0
> 0 6900 active+clean 41m 530'8766 530:20717 [19,9]p19
> [19,9]p19 2022-10-13T00:13:35.716060+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T07:08:16.779024+0000
> 3.17 56 0 0 0 234303488 0
> 0 10012 active+clean 41m 530'21461 530:31490 [0,13]p0
> [0,13]p0 2022-10-13T07:42:57.775955+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T14:52:30.758744+0000
> 3.18 47 0 0 0 197132288 0
> 0 10001 active+clean 41m 530'14783 530:20829 [10,14]p10
> [10,14]p10 2022-10-13T00:41:44.050740+0000
> 2022-10-10T14:57:31.136809+0000 1 periodic scrub
> scheduled @ 2022-10-14T09:30:02.438044+0000
> 3.19 50 0 0 0 209715200 0
> 0 10058 active+clean 41m 499'19880 530:27891 [8,23]p8
> [8,23]p8 2022-10-13T10:58:13.948274+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T19:55:12.268345+0000
> 3.1a 58 0 0 0 240123904 0
> 0 10037 active+clean 41m 530'36799 530:50997 [16,9]p16
> [16,9]p16 2022-10-13T02:03:18.026427+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T04:55:58.684437+0000
> 3.1b 53 0 0 0 219996160 0
> 0 10051 active+clean 41m 530'18388 530:29223 [0,22]p0
> [0,22]p0 2022-10-12T19:19:25.675030+0000
2022-10-12T19:19:25.675030+0000
> 4 periodic scrub scheduled @
> 2022-10-14T00:21:49.935082+0000
> 3.1c 66 0 0 0 276762624 0
> 0 10027 active+clean 41m 530'16327 530:38127 [20,5]p20
> [20,5]p20 2022-10-13T00:04:49.227288+0000
2022-10-10T15:00:38.834351+0000
> 1 periodic scrub scheduled @
> 2022-10-14T01:15:26.524544+0000
> 3.1d 49 0 0 0 201327104 0
> 0 10020 active+clean 41m 530'26433 530:51593 [17,9]p17
> [17,9]p17 2022-10-13T03:49:02.466987+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T09:04:39.909179+0000
> 3.1e 61 0 0 0 249098595 0
> 0 8790 active+clean 41m 530'8790 530:17807 [3,21]p3
> [3,21]p3 2022-10-12T22:28:19.417597+0000
2022-10-10T15:00:39.474873+0000
> 1 periodic scrub scheduled @
> 2022-10-13T23:49:55.974786+0000
> 3.1f 53 0 0 0 222056448 0
> 0 10053 active+clean 41m 530'35776 530:50234 [0,15]p0
> [0,15]p0 2022-10-13T07:16:46.787818+0000
2022-10-10T14:57:31.136809+0000
> 1 periodic scrub scheduled @
> 2022-10-14T16:24:45.860894+0000
>
> * NOTE: Omap statistics are gathered during deep scrub and may be
> inaccurate soon afterwards depending on utilization. See
> http://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics for
> further details.
>
> Em qui., 13 de out. de 2022 às 13:54, Eugen Block <eblock@xxxxxx>
escreveu:
>
>> Could you share more details? Does ceph report inactive PGs when one
>> node is down? Please share:
>> ceph osd tree
>> ceph osd pool ls detail
>> ceph osd crush rule dump <rule of affected pool>
>> ceph pg ls-by-pool <affected pool>
>> ceph -s
>>
>> Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:
>>
>> > Thanks for answering.
>> > Marc, but there is no mechanism to prevent IO pause? At the moment I
>> don't
>> > worry about data loss.
>> > I understand that putting it as replica x1 can work, but I need it to
be
>> x2.
>> >
>> > Em qui., 13 de out. de 2022 às 12:26, Marc <Marc@xxxxxxxxxxxxxxxxx>
>> > escreveu:
>> >
>> >>
>> >> >
>> >> > I'm having strange behavior on a new cluster.
>> >>
>> >> Not strange, by design
>> >>
>> >> > I have 3 machines, two of them have the disks. We can name them
like
>> >> > this:
>> >> > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks.
>> >> >
>> >> > I started bootstrapping through dcs1, added the other hosts and
left
>> mgr
>> >> > on
>> >> > dcs3 only.
>> >> >
>> >> > What is happening is that if I take down dcs2 everything hangs and
>> >> > becomes
>> >> > irresponsible, including the mount points that were pointed to
dcs1.
>> >>
>> >> You have to have disks in 3 machines. (Or set the replication to 1x)
>> >>
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>