Re: R: R: Re: CephFS troubleshooting

Alexander Patrakov <patrakov@xxxxxxxxx> · Wed, 4 Sep 2024 19:53:15 +0800

Hello Eugenio,

All previous "it just hangs" issues that I have seen previously were
down to some network problem. Please check that you can ping all OSDs,
MDSs, and MONs from the client. Please retest using large pings (ping
-M dont -s 8972 192.168.12.34). Please inspect firewalls. If multiple
network cards are used, make sure that cables are not accidentally
swapped.

On Wed, Sep 4, 2024 at 6:50 PM Eugenio Tampieri
<eugenio.tampieri@xxxxxxxxxxxxxxx> wrote:
>
> > Has it worked before or did it just stop working at some point? What's the exact command that fails (and error message if there is)?
>
> It was working using the NFS gateway, I never tried with the Ceph FUSE mount. The command is ceph-fuse --id migration /mnt/repo. No error message, it just hangs.
>
> > > For the "too many PGs per OSD" I suppose I have to add some other
> > > OSDs, right?
>
> > Either that or reduce the number of PGs. If you had only a few pools I'd suggest to leave it to the autoscaler, but not for 13 pools. You can paste 'ceph osd df' and 'ceph osd pool ls detail' if you need more input for that.
>
> I already have the autoscaler enabled. Here is the output you asked for
> ---
> ceph osd df
> ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
>  2    hdd  0.90970   1.00000  932 GiB  332 GiB  330 GiB  1.7 MiB  1.4 GiB  600 GiB  35.63  0.88  329      up
>  4    hdd  0.90970   1.00000  932 GiB  400 GiB  399 GiB  1.6 MiB  1.5 GiB  531 GiB  42.94  1.07  331      up
>  3    hdd  0.45479   1.00000  466 GiB  203 GiB  202 GiB  1.0 MiB  988 MiB  263 GiB  43.57  1.08  206      up
>  5    hdd  0.93149   1.00000  932 GiB  379 GiB  378 GiB  1.6 MiB  909 MiB  552 GiB  40.69  1.01  321      up
>                        TOTAL  3.2 TiB  1.3 TiB  1.3 TiB  5.9 MiB  4.8 GiB  1.9 TiB  40.30
> MIN/MAX VAR: 0.88/1.08  STDDEV: 3.15
> ---
> ceph osd pool ls detail
> pool 1 '.mgr' replicated size 3 min_size 3 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 24150 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
> pool 2 'kubernetes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/92 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> pool 3 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/123 flags hashpspool stripe_width 0 application rgw
> pool 4 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/132 flags hashpspool stripe_width 0 application rgw
> pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/132 flags hashpspool stripe_width 0 application rgw
> pool 6 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/134 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw
> pool 7 'repo_data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 30692 lfor 0/30692/30690 flags hashpspool stripe_width 0 application cephfs
> pool 8 'repo_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/150 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
> pool 9 '.nfs' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/169 flags hashpspool stripe_width 0 application nfs
> pool 11 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/592 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw
> pool 12 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/592 flags hashpspool stripe_width 0 application rgw
> pool 13 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/644 flags hashpspool stripe_width 0 application rgw
> pool 19 'kubernetes-lan' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/15682 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> ---
> Regards
>
> Zitat von Eugenio Tampieri <eugenio.tampieri@xxxxxxxxxxxxxxx>:
>
> > Hi Eugen,
> > Sorry, but I had some trouble when I signed up and then I was away so
> > I missed your reply.
> >
> >> ceph auth export client.migration
> >> [client.migration]
> >>         key = redacted
> >>         caps mds = "allow rw fsname=repo"
> >>         caps mon = "allow r fsname=repo"
> >>         caps osd = "allow rw tag cephfs data=repo"
> >
> > For the "too many PGs per OSD" I suppose I have to add some other
> > OSDs, right?
> >
> > Thanks,
> >
> > Eugenio
> >
> > -----Messaggio originale-----
> > Da: Eugen Block <eblock@xxxxxx>
> > Inviato: mercoledì 4 settembre 2024 10:07
> > A: ceph-users@xxxxxxx
> > Oggetto:  Re: CephFS troubleshooting
> >
> > Hi, I already responded to your first attempt:
> >
> > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/GS7KJ
> > RJP7BAOF66KJM255G27TJ4KG656/
> >
> > Please provide the requested details.
> >
> >
> > Zitat von Eugenio Tampieri <eugenio.tampieri@xxxxxxxxxxxxxxx>:
> >
> >> Hello,
> >> I'm writing to troubleshoot an otherwise functional Ceph quincy
> >> cluster that has issues with cephfs.
> >> I cannot mount it with ceph-fuse (it gets stuck), and if I mount it
> >> with NFS I can list the directories but I cannot read or write
> >> anything.
> >> Here's the output of ceph -s
> >>   cluster:
> >>     id:     3b92e270-1dd6-11ee-a738-000c2937f0ec
> >>     health: HEALTH_WARN
> >>             mon ceph-storage-a is low on available space
> >>             1 daemons have recently crashed
> >>             too many PGs per OSD (328 > max 250)
> >>
> >>   services:
> >>     mon:        5 daemons, quorum
> >> ceph-mon-a,ceph-storage-a,ceph-mon-b,ceph-storage-c,ceph-storage-d
> >> (age 105m)
> >>     mgr:        ceph-storage-a.ioenwq(active, since 106m), standbys:
> >> ceph-mon-a.tiosea
> >>     mds:        1/1 daemons up, 2 standby
> >>     osd:        4 osds: 4 up (since 104m), 4 in (since 24h)
> >>     rbd-mirror: 2 daemons active (2 hosts)
> >>     rgw:        2 daemons active (2 hosts, 1 zones)
> >>
> >>   data:
> >>     volumes: 1/1 healthy
> >>     pools:   13 pools, 481 pgs
> >>     objects: 231.83k objects, 648 GiB
> >>     usage:   1.3 TiB used, 1.8 TiB / 3.1 TiB avail
> >>     pgs:     481 active+clean
> >>
> >>   io:
> >>     client:   1.5 KiB/s rd, 8.6 KiB/s wr, 1 op/s rd, 0 op/s wr
> >> Best regards,
> >>
> >> Eugenio Tampieri
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> >> email to ceph-users-leave@xxxxxxx
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> > email to ceph-users-leave@xxxxxxx
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx