slow requests and short OSD failures in small cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear all,

we run a small cluster [1] that is exclusively used for virtualisation
(kvm/libvirt). Recently we started to run into performance problems
(slow requests, failing OSDs) for no *obvious* reason (at least not for
us).

We do nightly snapshots of VM images and keep the snapshots for 14
days. Currently we run 8 VMs in the cluster.

At first it looked like the problem was related to snapshotting images
of VMs that were up and running (respectively deleting the snapshots
after 14 days). So we changed the procedure to first suspend the VM and
the snapshot its image(s). Snapshots are made at 4 am.

When we removed *all* the old snapshots (the ones done of running VMs)
the cluster suddenly behaved 'normal' again, but after two days of
creating snapshots (not deleting any) of suspended VMs, the slow
requests started again (although by far not as frequent as before).

This morning we experienced subsequent failures (e.g. osd.2
IPv4:6800/1621 failed (2 reporters from different host after 49.976472
>= grace 46.444312) of 4 of our 6 OSDs, resulting in HEALTH_WARN with
up to about 20% of PGs active+undersized+degraded or stale+active+clean
or remapped+peering. No OSD failure lasted longer than 4 minutes. After
15 minutes everything was back to normal again. The noise started at
6:25 am, a time when cron.daily scripts run here.

We have no clue what could have caused this behavior :( There seems to
be no shortage of resources (CPU, RAM, network) that would explain what
happened, but maybe we did not look in the right places. So any hint on
where to look/what to look for would be greatly appreciated :)

[1]  cluster setup

Three nodes: ceph1, ceph2, ceph3

ceph1 and ceph2

    1x Intel(R) Xeon(R) CPU E3-1275 v3 @ 3.50GHz
    32 GB RAM
    RAID1 for OS
    1x Intel 530 Series SSDs (120GB) for Journals
    3x WDC WD2500BUCT-63TWBY0 for OSDs (1TB)
    2x Gbit Ethernet bonded (802.3ad) on HP 2920 Stack 

ceph3

    virtual machine
    1 CPU
    4 GB RAM 

Software

    Debian GNU/Linux Jessie (8.7)
    Kernel 3.16
    ceph 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f) 

Ceph Services

3 Monitors: ceph1, ceph2, ceph3

6 OSDs: ceph1 (3), ceph2 (3) 

Regards,
-- 
J.Hofmüller

           Nisiti
           - Abie Nathan, 1927-2008

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux