Re: Ceph mds is stuck in creating status

John Spray <jspray@xxxxxxxxxx> · Tue, 16 Oct 2018 13:58:14 +0100

On Mon, Oct 15, 2018 at 7:15 PM Kisik Jeong <kisik.jeong@xxxxxxxxxxxx> wrote:
>
> I attached osd & fs dumps. There are two pools (cephfs_data, cephfs_metadata) for CephFS clearly. And this system's network is 40Gbps ethernet for public & cluster. So I don't think the network speed would be problem. Thank you.

Ah, your pools do exist, I had just been looking at the start of the
MDS log where it hadn't seen the osdmap yet.

Looking again at your original log together with your osdmap, I notice
that your stuck operations are targeting OSDs 10,11,13,14,15, and all
these OSDs have public addresses in the 192.168.10.x range rather than
the 192.168.40.x range like the others.

So my guess would be that you are intending your OSDs to be in the
192.168.40.x range, but are missing some config settings for certain
daemons.

John

> 2018년 10월 16일 (화) 오전 1:18, John Spray <jspray@xxxxxxxxxx>님이 작성:
>>
>> On Mon, Oct 15, 2018 at 4:24 PM Kisik Jeong <kisik.jeong@xxxxxxxxxxxx> wrote:
>> >
>> > Thank you for your reply, John.
>> >
>> > I  restarted my Ceph cluster and captured the mds logs.
>> >
>> > I found that mds shows slow request because some OSDs are laggy.
>> >
>> > I followed the ceph mds troubleshooting with 'mds slow request', but there is no operation in flight:
>> >
>> > root@hpc1:~/iodc# ceph daemon mds.hpc1 dump_ops_in_flight
>> > {
>> >     "ops": [],
>> >     "num_ops": 0
>> > }
>> >
>> > Is there any other reason that mds shows slow request? Thank you.
>>
>> Those stuck requests seem to be stuck because they're targeting pools
>> that don't exist.  Has something strange happened in the history of
>> this cluster that might have left a filesystem referencing pools that
>> no longer exist?  Ceph is not supposed to permit removal of pools in
>> use by CephFS, but perhaps something went wrong.
>>
>> Check out the "ceph osd dump --format=json-pretty" and "ceph fs dump
>> --format=json-pretty" outputs and how the pool ID's relate.  According
>> to those logs, data pool with ID 1 and metadata pool with ID 2 do not
>> exist.
>>
>> John
>>
>> > -Kisik
>> >
>> > 2018년 10월 15일 (월) 오후 11:43, John Spray <jspray@xxxxxxxxxx>님이 작성:
>> >>
>> >> On Mon, Oct 15, 2018 at 3:34 PM Kisik Jeong <kisik.jeong@xxxxxxxxxxxx> wrote:
>> >> >
>> >> > Hello,
>> >> >
>> >> > I successfully deployed Ceph cluster with 16 OSDs and created CephFS before.
>> >> > But after rebooting due to mds slow request problem, when creating CephFS, Ceph mds goes creating status and never changes.
>> >> > Seeing Ceph status, there is no other problem I think. Here is 'ceph -s' result:
>> >>
>> >> That's pretty strange.  Usually if an MDS is stuck in "creating", it's
>> >> because an OSD operation is stuck, but in your case all your PGs are
>> >> healthy.
>> >>
>> >> I would suggest setting "debug mds=20" and "debug objecter=10" on your
>> >> MDS, restarting it and capturing those logs so that we can see where
>> >> it got stuck.
>> >>
>> >> John
>> >>
>> >> > csl@hpc1:~$ ceph -s
>> >> >   cluster:
>> >> >     id:     1a32c483-cb2e-4ab3-ac60-02966a8fd327
>> >> >     health: HEALTH_OK
>> >> >
>> >> >   services:
>> >> >     mon: 1 daemons, quorum hpc1
>> >> >     mgr: hpc1(active)
>> >> >     mds: cephfs-1/1/1 up  {0=hpc1=up:creating}
>> >> >     osd: 16 osds: 16 up, 16 in
>> >> >
>> >> >   data:
>> >> >     pools:   2 pools, 640 pgs
>> >> >     objects: 7 objects, 124B
>> >> >     usage:   34.3GiB used, 116TiB / 116TiB avail
>> >> >     pgs:     640 active+clean
>> >> >
>> >> > However, CephFS still works in case of 8 OSDs.
>> >> >
>> >> > If there is any doubt of this phenomenon, please let me know. Thank you.
>> >> >
>> >> > PS. I attached my ceph.conf contents:
>> >> >
>> >> > [global]
>> >> > fsid = 1a32c483-cb2e-4ab3-ac60-02966a8fd327
>> >> > mon_initial_members = hpc1
>> >> > mon_host = 192.168.40.10
>> >> > auth_cluster_required = cephx
>> >> > auth_service_required = cephx
>> >> > auth_client_required = cephx
>> >> >
>> >> > public_network = 192.168.40.0/24
>> >> > cluster_network = 192.168.40.0/24
>> >> >
>> >> > [osd]
>> >> > osd journal size = 1024
>> >> > osd max object name len = 256
>> >> > osd max object namespace len = 64
>> >> > osd mount options f2fs = active_logs=2
>> >> >
>> >> > [osd.0]
>> >> > host = hpc9
>> >> > public_addr = 192.168.40.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.1]
>> >> > host = hpc10
>> >> > public_addr = 192.168.40.19
>> >> > cluster_addr = 192.168.40.19
>> >> >
>> >> > [osd.2]
>> >> > host = hpc9
>> >> > public_addr = 192.168.40.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.3]
>> >> > host = hpc10
>> >> > public_addr = 192.168.40.19
>> >> > cluster_addr = 192.168.40.19
>> >> >
>> >> > [osd.4]
>> >> > host = hpc9
>> >> > public_addr = 192.168.40.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.5]
>> >> > host = hpc10
>> >> > public_addr = 192.168.40.19
>> >> > cluster_addr = 192.168.40.19
>> >> >
>> >> > [osd.6]
>> >> > host = hpc9
>> >> > public_addr = 192.168.40.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.7]
>> >> > host = hpc10
>> >> > public_addr = 192.168.40.19
>> >> > cluster_addr = 192.168.40.19
>> >> >
>> >> > [osd.8]
>> >> > host = hpc9
>> >> > public_addr = 192.168.40.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.9]
>> >> > host = hpc10
>> >> > public_addr = 192.168.40.19
>> >> > cluster_addr = 192.168.40.19
>> >> >
>> >> > [osd.10]
>> >> > host = hpc9
>> >> > public_addr = 192.168.10.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.11]
>> >> > host = hpc10
>> >> > public_addr = 192.168.10.19
>> >> > cluster_addr = 192.168.40.19
>> >> >
>> >> > [osd.12]
>> >> > host = hpc9
>> >> > public_addr = 192.168.10.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.13]
>> >> > host = hpc10
>> >> > public_addr = 192.168.10.19
>> >> > cluster_addr = 192.168.40.19
>> >> >
>> >> > [osd.14]
>> >> > host = hpc9
>> >> > public_addr = 192.168.10.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.15]
>> >> > host = hpc10
>> >> > public_addr = 192.168.10.19
>> >> > cluster_addr = 192.168.40.19
>> >> >
>> >> > --
>> >> > Kisik Jeong
>> >> > Ph.D. Student
>> >> > Computer Systems Laboratory
>> >> > Sungkyunkwan University
>> >> > _______________________________________________
>> >> > ceph-users mailing list
>> >> > ceph-users@xxxxxxxxxxxxxx
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> >
>> > --
>> > Kisik Jeong
>> > Ph.D. Student
>> > Computer Systems Laboratory
>> > Sungkyunkwan University
>
>
>
> --
> Kisik Jeong
> Ph.D. Student
> Computer Systems Laboratory
> Sungkyunkwan University
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com