R: R: Re: CephFS troubleshooting

Eugenio Tampieri <eugenio.tampieri@xxxxxxxxxxxxxxx> · Wed, 4 Sep 2024 10:42:00 +0000

> Has it worked before or did it just stop working at some point? What's the exact command that fails (and error message if there is)?

It was working using the NFS gateway, I never tried with the Ceph FUSE mount. The command is ceph-fuse --id migration /mnt/repo. No error message, it just hangs.

> > For the "too many PGs per OSD" I suppose I have to add some other 
> > OSDs, right?

> Either that or reduce the number of PGs. If you had only a few pools I'd suggest to leave it to the autoscaler, but not for 13 pools. You can paste 'ceph osd df' and 'ceph osd pool ls detail' if you need more input for that.

I already have the autoscaler enabled. Here is the output you asked for
---
ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
 2    hdd  0.90970   1.00000  932 GiB  332 GiB  330 GiB  1.7 MiB  1.4 GiB  600 GiB  35.63  0.88  329      up
 4    hdd  0.90970   1.00000  932 GiB  400 GiB  399 GiB  1.6 MiB  1.5 GiB  531 GiB  42.94  1.07  331      up
 3    hdd  0.45479   1.00000  466 GiB  203 GiB  202 GiB  1.0 MiB  988 MiB  263 GiB  43.57  1.08  206      up
 5    hdd  0.93149   1.00000  932 GiB  379 GiB  378 GiB  1.6 MiB  909 MiB  552 GiB  40.69  1.01  321      up
                       TOTAL  3.2 TiB  1.3 TiB  1.3 TiB  5.9 MiB  4.8 GiB  1.9 TiB  40.30
MIN/MAX VAR: 0.88/1.08  STDDEV: 3.15
---
ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 3 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 24150 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'kubernetes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/92 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/123 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/132 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/132 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/134 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw
pool 7 'repo_data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 30692 lfor 0/30692/30690 flags hashpspool stripe_width 0 application cephfs
pool 8 'repo_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/150 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 9 '.nfs' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/169 flags hashpspool stripe_width 0 application nfs
pool 11 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/592 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw
pool 12 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/592 flags hashpspool stripe_width 0 application rgw
pool 13 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/644 flags hashpspool stripe_width 0 application rgw
pool 19 'kubernetes-lan' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 24150 lfor 0/0/15682 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
---
Regards

Zitat von Eugenio Tampieri <eugenio.tampieri@xxxxxxxxxxxxxxx>:

> Hi Eugen,
> Sorry, but I had some trouble when I signed up and then I was away so 
> I missed your reply.
>
>> ceph auth export client.migration
>> [client.migration]
>>         key = redacted
>>         caps mds = "allow rw fsname=repo"
>>         caps mon = "allow r fsname=repo"
>>         caps osd = "allow rw tag cephfs data=repo"
>
> For the "too many PGs per OSD" I suppose I have to add some other 
> OSDs, right?
>
> Thanks,
>
> Eugenio
>
> -----Messaggio originale-----
> Da: Eugen Block <eblock@xxxxxx>
> Inviato: mercoledì 4 settembre 2024 10:07
> A: ceph-users@xxxxxxx
> Oggetto:  Re: CephFS troubleshooting
>
> Hi, I already responded to your first attempt:
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/GS7KJ
> RJP7BAOF66KJM255G27TJ4KG656/
>
> Please provide the requested details.
>
>
> Zitat von Eugenio Tampieri <eugenio.tampieri@xxxxxxxxxxxxxxx>:
>
>> Hello,
>> I'm writing to troubleshoot an otherwise functional Ceph quincy 
>> cluster that has issues with cephfs.
>> I cannot mount it with ceph-fuse (it gets stuck), and if I mount it 
>> with NFS I can list the directories but I cannot read or write 
>> anything.
>> Here's the output of ceph -s
>>   cluster:
>>     id:     3b92e270-1dd6-11ee-a738-000c2937f0ec
>>     health: HEALTH_WARN
>>             mon ceph-storage-a is low on available space
>>             1 daemons have recently crashed
>>             too many PGs per OSD (328 > max 250)
>>
>>   services:
>>     mon:        5 daemons, quorum
>> ceph-mon-a,ceph-storage-a,ceph-mon-b,ceph-storage-c,ceph-storage-d
>> (age 105m)
>>     mgr:        ceph-storage-a.ioenwq(active, since 106m), standbys:
>> ceph-mon-a.tiosea
>>     mds:        1/1 daemons up, 2 standby
>>     osd:        4 osds: 4 up (since 104m), 4 in (since 24h)
>>     rbd-mirror: 2 daemons active (2 hosts)
>>     rgw:        2 daemons active (2 hosts, 1 zones)
>>
>>   data:
>>     volumes: 1/1 healthy
>>     pools:   13 pools, 481 pgs
>>     objects: 231.83k objects, 648 GiB
>>     usage:   1.3 TiB used, 1.8 TiB / 3.1 TiB avail
>>     pgs:     481 active+clean
>>
>>   io:
>>     client:   1.5 KiB/s rd, 8.6 KiB/s wr, 1 op/s rd, 0 op/s wr
>> Best regards,
>>
>> Eugenio Tampieri
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an 
>> email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an 
> email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx