Enchilada is a research project that takes one central idea to the extreme:
Information cannot be destroyed.
Of course, Enchilada can only deal with indestructible digital information, or information that can be represented with bits and bytes.
What makes Enchilada extreme is that it not only fixes 'plain' data to be immutable, but also partially evaluated (source) code, and all the states of the Enchilada system.
Immutable information brings many advantages to the table, ranging from practical to theoretical:
For sake of argument, let's replace 'information' with 'documents' and tackle the more practical problem of document management.
It is obvious that even in the case of document management, some extreme compression techniques need to be applied to make the central idea practical.
So let us first define compression, or rather, the compression of (a set of) documents:
A compressed set of documents is an alternative representation of the 'same' set, but than with fewer duplicate words, sentences, paragraphs or (sub)documents.
We can achieve compression when we:
1. Physically store references to documents, rather than to physically store occurrences of documents.
2. Recursively build 'bigger' documents from 'smaller' documents through references.
3. Reach global compression when (sub)documents maximally share (references to) other (sub)documents.
To ease the compression of documents, we define an abstract service that can store and retrieve documents, given the following API:
put(document) -> reference get(reference) -> document
A reference is created when a document is put in storage. Later, a stored document can be get from storage via a reference.
The following example shows how we can use the storage service to compress a 'document':
the quick brown fox jumps over the lazy dog
First we chop the document into three small (atomic) documents which are potentially reusable (following compression step 2 and 3).
put(the quick brown fox) => r1
put(jumps over) => r2
put(the lazy dog) => r3
Then we store a special 'node' document that contains the first two references (with a space in between) to get reference r4.
put(r1 r2) => r4
After that we store another node document that contains references to the previous node and the third atomic document:
put(r4 r3) => r5
We can now say that reference r5 represents the uncompressed document.
With all those additional references, we have only used up more space instead of less. However, we can gain storage when we slightly 'modify' the document to:
the quick brown fox jumps over the lazy don
Now we only need to compress the atomic sentence that contains the edit:
put(the lazy don) => r6
.. and compress the revised document by reusing parts of the unrevised one:
put(r4 r6) => r7
Note that nothing is modified: a new document is created by sharing parts of another document through references.
For a single storage service to remain internally consistent, it must serialize all storage requests.
But when thousands of concurrent users want to store and share documents via the same central service, such service would quickly become a bottleneck.
So if we want storage to work on a global scale, we need to solve the serialization problem.
One scalable approach is to let users store their documents in local storage and then to connect all local storage via the network to create one global store.
In the case that a local storage cannot service a request (get or put), the local storage may decide to piggyback a request to other local stores.
TODO: Picture of a DHT
For such scheme to work however, another problem needs to be solved: how does a local storage create globally unique and non-forgable references without (global) coordination?
Surprisingly, one can create a reference (to a document) that is trustworthy and reproducibly unique - anywhere. This can be done by compressing all of a document's content into a small sized string, called the hash or fingerprint of a document.
Compared to our earlier definition of compression, hashing is a lossy compression method and doesn't preserve document structure, but completely destroys it.
But for hashes to serve as trustworthy references, they need to have some more (cryptographic) properties:
Not surprisingly, there are numerious cryptographic algorithms (MD5, SHA1, etc) available that have all these properties.
For now we choose SHA1 as our hash function, because SHA1 (encoded in base64) generates relatively short hashes, and is cryptographically strong.
However, a SHA1 hash for every reference incurs a hefty storage penalty. Compared to 64/32 bit pointers we incur a 2.5/5 factor extra storage per reference.
TODO: explain this isn't a problem
Because each 'edit' creates a new document, it effectively spawns a fresh branch in the multi-versioned worlds of documents.
However, versions do need to be merged in a timely fashion, otherwise no consensus will ever be reached.