I-D Action:draft-kunze-pairtree-01.txt

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



A New Internet-Draft is available from the on-line Internet-Drafts directories.

	Title           : Pairtrees for Object Storage (V0.1) http://www.ietf.org/internet-drafts/draft-kunze-pairtree-01.txt
	Author(s)       : J. Kunze, et al.
	Filename        : draft-kunze-pairtree-01.txt
	Pages           : 16
	Date            : 2008-11-26

This document specifies Pairtree, a filesystem hierarchy for holding
objects that are located within that hierarchy by mapping identifier
strings to object directory (or folder) paths two characters at a
time.  If an object directory (folder) holds all the files, and
nothing but the files, that comprise the object, a "pairtree" can be
imported by a system that knows nothing about the nature or structure
of the objects but can still deliver any object's files by requested
identifier.  The mapping is reversible, so the importing system can
also walk the pairtree and reliably enumerate all the contained
object identifiers.  To the extent that object dependencies are
stored inside the pairtree (e.g., fast indexes stored outside contain
only derivative data), simple or complex collections built on top of
pairtrees can recover from index failures and reconstruct a
collection view simply by walking the trees.  Pairtrees have the
advantage that many object operations, including backup and restore,
can be performed with native operating system tools.1.  The basic pairtree
algorithm

The pairtree algorithm maps an arbitrary UTF-8 [RFC3629] encoded
identifier string into a filesystem directory path based on
successive pairs of characters, and also defines the reverse mapping
(from pathname to identifier).

In this document the word "directory" is used interchangeably with
the word "folder" and all examples conform to Unix-based filesystem
conventions which should tranlate easily to Windows conventions after
substituting the path separator ('\' instead of '/').  Pairtree
places no limitations on file and path lengths, so implementors
thinking about maximal interoperation may wish to consider the issues
listed in the Interoperability section of this document.

The mapping from identifier string to path has two parts.  First, the
string is cleaned by converting characters that would be illegal or
especially problemmatic in Unix or Windows filesystems.  The cleaned
string is then split into pairs of characters, each of which becomes
a directory name in a filesystem path: successive pairs map to
successive path components until there are no characters left, with
the last component being either a 1- or 2-character directory name.
The resulting path is known as a _pairpath_, or _ppath_.

abcd

-> ab/cd/
abcdefg
-> ab/cd/ef/g/
12-986xy4 -> 12/-9/86/xy/4/

Armed with specific knowledge of a given namespace's identifier
distribution, one might achieve more balanced or efficient trees by
mapping to paths from character groupings other than successive
pairs.  Pairtree assumes that this sort of optimization, however,
being tailored to individual and transient namespace conditions, is
often less important than having a single generalized and shareable
mapping.  It uses pairs of characters to achieve hierarchies that
exhibit a reasonable balance of path length and fanout (number of
probable entries in any component directory).2.  Pairpath termination and
object encapsulation

A ppath (pairpath) terminates when it reaches an object.  A little
jargon helps explain this.  A _shorty_ is a 1- or 2-character
directory name, or any file or directory name that begins with
"pairtree" (these are reserved for future use).  A ppath consists of
a sequence of "shorties" ending in a non-shorty, such as a
3-character directory name or the 2-character file name "xy".  The
pairtree below contains two objects with identifiers "abcd" and
"abcde".

ab/
|
\--- cd/

  |

  |--- foo/

  |
 |
README.txt

  |
 |
thumbnail.gif

  |
 |

  |
 |--- master_images/

  |
 |
 |
..

  |
 |
 ...

  |
 |

  |
 \--- gh/

  |

  \--- e/



 |



 \--- bar/





|
metadata





|
54321.wav





|
index.html

An object is reached when a non-shorty is detected.  An object is
_properly encapsulated_ if it is entirely contained in a non-shorty
directory that is the immediate child of a shorty directory, in other
words, if the 1- or 2-char directory name ending the object's ppath
contains exactly one non-shorty directory that holds all the object's
descendants.  The two objects "abcd" and "abcde" above are properly
encapsulated.  Any shorty directory found at the same level as the
non-shorty extends the pairtree.  So while the "foo/" directory above
does not subsume "e/" at the same level, by encapsulation, it does
subsume the "gh/" underneath it (i.e., "gh/" is invisible to the
pairtree algorithm, at least on a first pass).

Practice will vary according to local custom as to how to name the
encapsulating object directory beneath that last shorty.  Its name is
completely independent of the object identifier.  For example, every
object directory in a pairtree could have the uniform name "thingy".
It is common for the directory name to be a terminal substring of the
object identifier, as in:


id:  13030_45xqv_793842495
ppath:  13/03/0_/45/xq/v_/79/38/42/49/5/793842495

All objects should be properly encapsulated.  If an object is
detected that is _improperly encapsulated_, that is, when a ppath
ends with a shorty directory that contains more than one non-shorty,
the detecting system should take corrective action.  In this
situation, also known as a "split end", all those non-shorties
(directories and files) are considered to belong to one object (not
properly encapsulated) identified by the containing ppath.  Excluding
shorties from the object permits one identifier to be a substring of
another (e.g., "abcd" and "abcde" can co-exist in a pairtree), and
defining ppath termination in this way prevents "hidden riders", or
data residing in a pairtree that is not contained or accounted for in
any object.  Here is an example of an improperly encapsulated object
named "bent".


be/
|
\--- nt/




[ split end: two files, no encapsulation ]

  |
README.txt

  |
report.pdf

  |

  \--- ef/



 |
..

If a "split end" is encountered, an importing system is encouraged to
normalize it by creating a single object directory called "obj" and
pushing the non-shorties in question underneath it, as in:

be/
|
\--- nt/

  |

  |--- obj/

  [ split end repaired with "obj" directory ]

  |
 |
README.txt

  |
 |
report.pdf

  |

  \--- ef/



 |
..

A URL for this Internet-Draft is:
http://www.ietf.org/internet-drafts/draft-kunze-pairtree-01.txt

Internet-Drafts are also available by anonymous FTP at:
ftp://ftp.ietf.org/internet-drafts/

Below is the data which will enable a MIME compliant mail reader
implementation to automatically retrieve the ASCII version of the
Internet-Draft.
<ftp://ftp.ietf.org/internet-drafts/draft-kunze-pairtree-01.txt>
_______________________________________________

I-D-Announce@ietf.org
https://www.ietf.org/mailman/listinfo/i-d-announce
Internet-Draft directories: http://www.ietf.org/shadow.html
or ftp://ftp.ietf.org/ietf/1shadow-sites.txt

[Index of Archives]     [IETF]     [IETF Discussion]     [Linux Kernel]

  Powered by Linux