Re: data loss on full file system?

Håkan T Johansson <f96hajo@xxxxxxxxxxx> · Wed, 5 Feb 2020 23:44:32 +0100

On Mon, 3 Feb 2020, Paul Emmerich wrote:

On Sun, Feb 2, 2020 at 9:35 PM Håkan T Johansson <f96hajo@xxxxxxxxxxx> wrote:

Changing cp (or whatever standard tool is used) to call fsync() before
each close() is not an option for a user.  Also, doing that would lead to
terrible performance generally.  Just tested - a recursive copy of a 70k
files linux source tree went from 15 s to 6 minutes on a local filesystem
I have at hand.

Don't do it for every file:  cp foo bar; sync

Does not help:

$ md5sum  ~/rnd100M
2e6c0b54748fa04dfcc54c1705e11a20  /home/htj/rnd100M
$ for i in `seq --format="%05.0f" 1 1000` ; do cp ~/rnd100M rnd1_$i ; done
$ sync
$ for i in `seq --format="%05.0f" 1 50 1000` ; 
do md5sum rnd1_$i ; done
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00001
2f282b84e7e608d5852449ed940bfc51  rnd1_00051
2f282b84e7e608d5852449ed940bfc51  rnd1_00101
2f282b84e7e608d5852449ed940bfc51  rnd1_00151
2f282b84e7e608d5852449ed940bfc51  rnd1_00201
2f282b84e7e608d5852449ed940bfc51  rnd1_00251
2f282b84e7e608d5852449ed940bfc51  rnd1_00301
2f282b84e7e608d5852449ed940bfc51  rnd1_00351
2f282b84e7e608d5852449ed940bfc51  rnd1_00401
2f282b84e7e608d5852449ed940bfc51  rnd1_00451
2f282b84e7e608d5852449ed940bfc51  rnd1_00501
2f282b84e7e608d5852449ed940bfc51  rnd1_00551
2f282b84e7e608d5852449ed940bfc51  rnd1_00601
2f282b84e7e608d5852449ed940bfc51  rnd1_00651
2f282b84e7e608d5852449ed940bfc51  rnd1_00701
2f282b84e7e608d5852449ed940bfc51  rnd1_00751
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00801
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00851
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00901
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00951
$ for i in `seq --format="%05.0f" 1 50 1000` ; do ls -l rnd1_$i ; done
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00001
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00051
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00101
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00151
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00201
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00251
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00301
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00351
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00401
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00451
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00501
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00551
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00601
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00651
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00701
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00751
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00801
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00851
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00901
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00951

(2f282... is the md5sum of a 100 MiB file of 0s)

md5sums for the transition to filesystem full:

2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00018
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00019
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00020
29a396ece342d8b2bc8ca509d961bd02  rnd1_00021
ee7e0deeb6c817bddf7930c3984da83d  rnd1_00022
c07ad8b66905d90fc37183b2bc3ba3ee  rnd1_00023
42f2a55b82642632fcee8d521038e531  rnd1_00024
2f282b84e7e608d5852449ed940bfc51  rnd1_00025
2f282b84e7e608d5852449ed940bfc51  rnd1_00026
2f282b84e7e608d5852449ed940bfc51  rnd1_00027

-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00018
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00019
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00020
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00021
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00022
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00023
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00024
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00025
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00026
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00027

This was done on a filesystem that had about 3 GB free space:
Writing here 100 GB in total forced much of the data again out of the 
client cache, and subsequent md5sum therefore gets different data.

Note that POSIX only says that sync() shall schedule the updates to the 
filesystem, not necessarily wait until completion.  fsync() however shall 
wait until completion.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
https://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html

Side-note: when I used a larger file (1G) as the copy source, sometimes 
out-of-space was reported.  But the results were still not reliable.

I do not see how to fulfill the requirement that a read() after a 
successful write() shall get the written data, without the cephfs client 
asking the respective to-be OSD node of each block for a reservation, 
before returning success for the write(). Such reservations, the clients 
could request speculatively as soon as they have opened a file in write 
mode. Alternative would be that clients cannot drop caches until the 
out-of-space condition has been cleared.

Cheers,
Håkan

Best regards,
Håkan

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Jan 27, 2020 at 9:11 PM Håkan T Johansson <f96hajo@xxxxxxxxxxx> wrote:

Hi,

for test purposes, I have set up two 100 GB OSDs, one
taking a data pool and the other metadata pool for cephfs.

Am running 14.2.6-1-gffd69200ad-1 with packages from
https://mirror.croit.io/debian-nautilus

Am then running a program that creates a lot of 1 MiB files by calling
   fopen()
   fwrite()
   fclose()
for each of them.  Error codes are checked.

This works successfully for ~100 GB of data, and then strangely also succeeds
for many more 100 GB of data...  ??

All written files have size 1 MiB with 'ls', and thus should contain the data
written.  However, on inspection, the files written after the first ~100 GiB,
are full of just 0s.  (hexdump -C)

To further test this, I use the standard tool 'cp' to copy a few random-content
files into the full cephfs filessystem.  cp reports no complaints, and after
the copy operations, content is seen with hexdump -C.  However, after forcing
the data out of cache on the client by reading other earlier created files,
hexdump -C show all-0 content for the files copied with 'cp'.  Data that was
there is suddenly gone...?

I am new to ceph.  Is there an option I have missed to avoid this behaviour?
(I could not find one in
https://docs.ceph.com/docs/master/man/8/mount.ceph/ )

Is this behaviour related to
https://docs.ceph.com/docs/mimic/cephfs/full/
?

(That page states 'sometime after a write call has already returned 0'. But if
write returns 0, then no data has been written, so the user program would not
assume any kind of success.)

Best regards,

Håkan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx