Re: OSD on XFS ENOSPC at 84% data / 5% inode and inode64?

Laurent GUERBY <laurent@xxxxxxxxxx> · Thu, 26 Nov 2015 21:04:18 +0100

On Thu, 2015-11-26 at 22:13 +0300, Andrey Korolyov wrote:
> On Thu, Nov 26, 2015 at 1:29 AM, Laurent GUERBY <laurent@xxxxxxxxxx> wrote:
> > Hi,
> >
> > After our trouble with ext4/xattr soft lockup kernel bug we started
> > moving some of our OSD to XFS, we're using ubuntu 14.04 3.19 kernel
> > and ceph 0.94.5.
> >
> > We have two out of 28 rotational OSD running XFS and
> > they both get restarted regularly because they're terminating with
> > "ENOSPC":
> >
> > 2015-11-25 16:51:08.015820 7f6135153700  0 filestore(/var/lib/ceph/osd/ceph-11)  error (28) No space left on device not handled on operation 0xa0f4d520 (12849173.0.4, or op 4, counting from 0)
> > 2015-11-25 16:51:08.015837 7f6135153700  0 filestore(/var/lib/ceph/osd/ceph-11) ENOSPC handling not implemented
> > 2015-11-25 16:51:08.015838 7f6135153700  0 filestore(/var/lib/ceph/osd/ceph-11)  transaction dump:
> > ...
> >         {
> >             "op_num": 4,
> >             "op_name": "write",
> >             "collection": "58.2d5_head",
> >             "oid": "53e4fed5\/rbd_data.11f20f75aac8266.00000000000a79eb\/head\/\/58",
> >             "length": 73728,
> >             "offset": 4120576,
> >             "bufferlist length": 73728
> >         },
> >
> > (Writing the last 73728 bytes = 72 kbytes of 4 Mbytes if I'm reading
> > this correctly)
> >
> > Mount options:
> >
> > /dev/sdb1 /var/lib/ceph/osd/ceph-11 xfs rw,noatime,attr2,inode64,noquota
> >
> > Space and Inodes:
> >
> > Filesystem     Type      1K-blocks       Used Available Use% Mounted on
> > /dev/sdb1      xfs      1947319356 1624460408 322858948  84% /var/lib/ceph/osd/ceph-11
> >
> > Filesystem     Type        Inodes   IUsed     IFree IUse% Mounted on
> > /dev/sdb1      xfs       48706752 1985587  46721165    5% /var/lib/ceph/osd/ceph-11
> >
> > We're only using rbd devices, so max 4 MB/object write, how
> > can we get ENOSPC for a 4MB operation with 322 GB free space?
> >
> > The most surprising thing is that after the automatic restart
> > disk usage keep increasing and we no longer get ENOSPC for a while.
> >
> > Did we miss a needed XFS mount option? Did other ceph users
> > encounter this issue with XFS?
> >
> > We have no such issue with ~96% full ext4 OSD (after setting the right
> > value for the various ceph "fill" options).
> >
> > Thanks in advance,
> >
> > Laurent
> >
> 
> Hi, from given numbers one can conclude that you are facing some kind
> of XFS preallocation bug, because ((raw space divided by number of
> files)) is four times lower than the ((raw space divided by 4MB
> blocks)). At a glance it could be avoided by specifying relatively
> small allocsize= mount option, of course by impacting overall
> performance, appropriate benchmarks could be found through
> ceph-users/ceph-devel. Also do you plan to preserve overcommit ratio
> to be that high forever?

Hi,

Thanks for your answer.

On these disks we have 3 active pools all doing rbd images : a regular
3-replica one (4MB files), an EC 4+1 (1 MB files) and an EC 8+2 (512 kB
files), we're currently copying rbd images from EC 4+1 to EC 8+2 so we
have temporary high disk usage on some disks until we remove the EC 4+1
pool. 

We're still using straw and not straw2 so we have 59% to 96% usage
depending on the disk. We have new disks and nodes ready that we plan to
add while migrating to straw2, but we need to choose first wether to use
ext4 or XFS on these new nodes hence my mail.

Having reread http://xfs.org/index.php/XFS_FAQ with your advice
in mind I see why speculative preallocation could cause issue
with ceph which in our case will have mostly fixed size files.
And these issues are temporary because of the XFS
scanner "to perform background trimming of files with lingering post-EOF
preallocations" is running after 5 minutes:

$ cat /proc/sys/fs/xfs/speculative_prealloc_lifetime
300

Message with a probleme similar to our:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/038817.html

We'll probably go with XFS but adding allocsize=128k.

Sincerely

Laurent

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com