new filesystem

This document is current as of Mnet v0.7.

CVS revision: $Id: new_filesystem.html,v 1.123 2004/07/17 11:33:25 arnowa Exp $

status of this document

The design recorded in this document has been implemented, except the feature of packing the entire file into the inode block isn't implemented.

There is one known buglet in this design -- if the file were packed into the inode block, and the decryption key were incorrect, then there is a tiny chance, which I currently estimate to be around 2^-64, that the user could see a file of random gibberish (that is, whatever you get by decrypting the ciphertext with that incorrect key) rather than an error message. I haven't analyzed it more closely. We've decided to add an HMAC tag at the end of the inode in order to guarantee the file's integrity. This change has not yet been written into this document nor implemented.

desiderata

  1. minimal round-trips to download
  2. no chunking -- that is the erasure code must be applied to the entire file at once, not to separate chunks of the file (in order to allow reliable huge files)
  3. blocks fall into a small number of size classes (in order to facilitate privacy)
  4. minimal size mnet URIs (in order to facilitate human use and programmer use)
  5. no features which could be implemented just as well or better at a different layer:
    1. no metadata except for that needed for downloading and authenticating (see discussion below)
    2. no extensibility (except for unambiguous version numbers so that future code can know what version of the format it is looking at)

design

the basic idea

Choose a "user-supplied key seed". If you want the merge duplicate encrypted blocks feature (see "merge duplicate encrypted blocks", below), then choose your key seed to be the zero-length string. Generate the AES encryption key (see "encryption key generation", below).

Take a file and break it into a bunch of shares using an erasure code. Encrypt the shares, and a "shareId" is the SHA-1 hash of an encrypted share. Put all the shareIds together and call the result an "inode". Encrypt the inode, and the "inodeId" is the SHA-1 hash of the encrypted inode. An mnet URI is an inodeId concatenated with the symmetric encryption key, and mnet-base-32 encoded.

Simple!

Note that everything is encrypted with a key, and the key is present in only one place -- the mnet URI. This offers cryptographic protection of the property that if you don't have the mnet URI, then all of the blocks (whether inode blocks or data blocks) are opaque to you. You can't even tell which blocks are inode blocks and which blocks are data blocks, except with statistical analysis of usage patterns.

the merge duplicate encrypted blocks feature

It might be desirable that the same file stored by two different users should yield the same encrypted blocks. This increases the reliability and reduces the storage requirements. But it also reduces privacy, in that it allows someone who has a copy of the file to determine if certain blocks are part of that file. If the user desires this feature, they can choose the user-supplied key seed to be the string with zero length. If they do not desire this feature, they should choose a key seed that is unguessable, even to someone who has a copy of the file.

encryption key generation

The symmetric cipher encryption key is equal to the first 128 bits of the SHA-1 hash of a key generation struct. A key generation struct is a string that looks like this:

((1:M2:M)(1:K2:K)(22:user-supplied key seedlen_of_seed:seed)(18:sha-1 hash of file20:hash_of_file)(7:purpose83:Part of http://mnet.sf.net/. Generate encryption keys from user-supplied key seeds.))

M, K are each two bytes long, seed is variable length, len_of_seed is the length of seed in bytes expressed in decimal notation, and hash_of_file is 20 bytes long.

(This is Rivest-style s-expression format. That is: every item is either a parenthesized list of items, or else it is a string of bytes preceded by its length and a colon character.)

syntax

The first eight bytes of an inode look like this:

[vernum (1 byte)] [file length (7 bytes)]

In this version of ZNFF, the "vernum" field always contains a 0 byte. Then, if the file is small enough to fit into this inode (i.e., if file length <= len(inode) - 8), then the file contents are present in the inode, and any remaining space is padded with zeros. In that case, the remainder of the inode looks like this:

[file contents (variable length == file length)] [zero padding (variable length == to the end of the inode)]

Else, the remainder of the inode look like this:

[K (2 bytes)] [M (2 bytes)] [blockId (20 bytes)] * M [zero padding (variable length == to the end of the inode)]

(That means there are exactly M instances of the "blockId" field.)

erasure code

We currently use Rizzo's erasure code and (modified and optimized) C implementation.

http://info.iet.unipi.it/~luigi/research.html

(Search in text for "erasure codes".)

imperative description

Here is a pseudocode recipe for encoding a file into ZNFF.

  1. Choose parameters: blocksize B, inode blocksize BI, total shares M, required shares K, encryption key KEY.
  2. If (file length <= BI - 8), write a "file contents" field into the inode and skip to step 6.
  3. Run the erasure code on the entire file, resulting in a set of shares. Pad each share to B with zeros.
  4. Encrypt the i'th share with CTR mode encryption, starting the counter at (i+1)*2^64.
  5. Compute the blockIds and write them into the inode in the appropriate places.
  6. Encrypt the inode with CTR mode encryption, starting the counter at 0.
  7. Done!

how to choose parameters

Requirements: B, the block size, must fall into a few size classes. This facilitates privacy, and it can also ease management of lots of blocks in a local store (which is something block servers have to do). The minimum size for a block is 2^16 bytes (arbitrarily chosen), and it must be a power of two (arbitrarily chosen). K must be >= 2 (else we're just doing replication, not erasure coding) and M must be >= K. K+M must be < 2^8 (this is a constraint of the erasure code). B*K must be >= F (the file length) (in order to hold the contents of the file). BI has the same constraints as B.

Actually I don't think the erasure code requires K+M < 2^8. The fec.c source code simply checks that K <= 2^8 and M <= 2^8. I must have misunderstood the paper, or else the source code isn't bothering to check the constraint properly!

Recommendations:

  1. set BI = 2^16
  2. set R = 4. This is a "redundancy factor" which is used to determine M given K.
  3. set B = 2^16
  4. set K = ceil(F / B)
  5. set M = K * R
  6. if M > 2^8, then set B = B * 2 and go to step 4.

The result will be the smallest block size B that can hold the file, and the smallest share num K for that block size.

limits

There is only one hard limit on file size: the "file length" field is 7 bytes == 56 bits, so the largest file can only be (2^56) - 1 bytes long.

open issues

robustness of inode blocks

Note that in order for a file to be retrievable, it must be the case that any K of the shares are retrievable and the whole, unique inode itself. How to accomplish this is out of the scope of this document, which says nothing about such things. It would be possible to perform erasure coding on the inode itself, but then you would have the same problem: you would have to have any K of the shares of the file, plus any K-prime of the shares of the inode, and the whole, unique object which contains the list of the shareIds of the shares of the inode. You could solve this problem by making that object be the mnet URI itself, i.e. by putting the list of the shareIds of the inode into the mnet URI, but this would violate desideratum 4. Indeed, you can see that the problem of the inode not being erasure-coded is a direct and unavoidable consequence of desideratum 4.

If you want to increase your file's survivability and the speed with which it downloads, and you are willing to relinquish desideratum 4, you can give someone the full inode (minus the zero padding) instead of giving them the mnet URI. You can mix-and-match mnetURIs and inodes freely and dynamically, for example, you can store a file (including storing the padded, encrypted inode block) and then later you can decide whether to give someone a copy of the inode or the mnet URI.

metadata

So far the only thing we've taken care of is the actual data. If you start with the mnet URI (or the inode and the encryption key) then you can get the file contents. But in actual use, there is always some metadata or other that you also want.

An especially important piece of metadata is type. Currently this format doesn't store the filename extension or MIME type of the file. Users who download stuff from Mnet would probably like their file to open in the appropriate application, or to save it to their filesystem under a reasonable file name. One option is to use something like the file(1) utility to tell what the file is on the receiving side. Two problems with that are that file(1) is fallible, and it does not offer a suggested extension with which to construct a filename to save the file.

One problem with putting metadata in the inode is that whenever the metadata differs it causes the mnet URI to be different.

General options for metadata:

  1. Put just enough metadata in the inode to discern the type:
    1. Store the filename extention of the data in the header of inode block. Pros: Works all over. Cons: my .mpeg file is the same as your .mpg file (extentions have a lot of overlap)
    2. Store the mimetype of the data in the header of inode block. Pros: mime types are fairly unambiguous. Cons: there isn't a mime type for every filetype out there
  2. Put lots of metadata in the inode, but only immutable metadata. There are two types of metadata, mutable (filename, artist name, album name), and immutable (sha1 of file, bitprint of file, type of file (arguably immutable)).
  3. Leave metadata out of the inode, but put it in the mnet URI. For example: <a type="text/plain" href="mnet:38ppp56jbb8b64zrh8reoadzgn1zpdxc76enkmqduwtf4tug">a short story I wrote</a>. (Yes, that type attrib is valid HTML 4.0 acording to indexdot. No idea if it's supported in any browsers.)
  4. Leave metadata out of the inode, store the data file, yielding an mnet URI, then write a text file that contains the metadata and the mnet URI, like this: <mnet:38ppp56jbb8b64zrh8reoadzgn1zpdxc76enkmqduwtf4tug> <http://purl.org/dc/elements/1.1/format> "text/plain", <mnet:38ppp56jbb8b64zrh8reoadzgn1zpdxc76enkmqduwtf4tug> <http://purl.org/dc/elements/1.1/title> "my body is a simulation". Then store that file, and give out its mnet URI instead of the data file's mnet URI.
  5. Leave metadata out of the inode, store the data file, yielding an mnet URI, then write a text file that contains the metadata, like this: type=text/plain\ntitle=my body is a simulation. Then store that file, and give out a combination of the data file mnet URI plus the metadata file mnet URI, like this: <a href="mnet:38ppp56jbb8b64zrh8reoadzgn1zpdxc76enkmqduwtf4tug|z8nbpsio1nisuap1f6946ome437chdxosyznes6koetaz6yy>a short story I wrote</a>.

We've currently agreed to leave all file metadata (even type metadata and immutable metadata) out of the inode for now because (a) there are several good possibilities for how to put this metadata in a higher layer that we want to explore and (b) we can change our minds and add it into the inode later if none of those possibilities turns out well enough. --Zooko 2003-03-11

acknowledgements

Thanks to Myers Carpenter, Hauke Johannknecht, Jim McCoy, Luke Nelson for help with this design. Extra thanks to Myers for proposing a fix to the problem of the encryption system falling apart in the presence of re-used keys.


Zooko
Last modified: Sat Feb 28 14:35:32 EST 2004