For questions that are not addressed here, please write to the mailing list.
Bitcache is a distributed content-addressable storage (CAS) system. It provides repository storage for bitstreams (colloquially known as blobs) of any length, each uniquely identified and addressed by a digital fingerprint derived through a secure cryptographic hash algorithm.
Bitcache provides a command-line client (called bit) for managing and querying repositories, a standalone HTTP server (bitcached) for serving or proxying repositories using a simple web-native REST API, and a synchronization utility (bitsync) for replicating bitstreams between repositories.
Bitcache is presently used on several hundred websites or more as part of the File Framework media management solution for the popular Drupal content management system.
The Bitcache module requires Drupal 6.x and PHP 5.2.x (or newer).
sites/all/modules/bitcache/ under your Drupal installation directory.File Framework for Drupal builds upon Bitcache to provide a comprehensive document and media management system for Drupal.
The Bitcache tools require Ruby 1.8 or newer, as well as RubyGems.
To install the latest development version from the GitHub repository:
$ sudo gem install bendiken-bitcache --source http://gems.github.com
At present, the best option is to run the Drupal version of Bitcache.
Using the Ruby standalone server is, for the moment, undocumented.
In computer science, a digital fingerprinting algorithm is a procedure that maps an arbitrarily large bitstream (such as a computer file) to a much shorter bit string, its digital fingerprint, that uniquely identifies the original data for all practical purposes.
To serve its intended purposes, a fingerprinting algorithm must be able to capture the identity of a bitstream with virtual certainty. In other words, the probability of a collision — two bitstreams yielding the same fingerprint — must be so vanishingly negligible that it can be ignored for most practical purposes.
Cryptographic hash functions such as the Secure Hash Algorithm (SHA) set of algorithms generally serve as good fingerprint functions.
You are dealing with digital fingerprints every time you invoke md5sum or sha1sum to verify the integrity of a file:
$ echo "Hello, world" > hello.txt $ sha1sum hello.txt SHA1(hello.txt)= 7b4758d4baa20873585b9597c7cb9ace2d690ab8
Content-addressable storage relies on digital fingerprints serving as unique content identifiers. Bitcache by default makes use of the widely-deployed SHA-1 algorithm for digital fingerprints, with future plans to also support the SHA-2 family of algorithms.
Here's what Wikipedia has to say:
Content-addressable storage, also referred to as associative storage or abbreviated CAS, is a mechanism for storing information that can be retrieved based on its content, not its storage location. [...] Roughly speaking, content-addressable storage is the permanent-storage analogue to content-addressable memory.
Bitcache by default uses the SHA-1 algorithm for fingerprinting data.
Here follows a brief analysis of hash collision probabilities by the designers of the Plan 9 Venti file system, which also relies on SHA-1 for content addressability:
[Our design] requires a hash function that generates a unique fingerprint for every data block that a client may want to store. Obviously, if the size of the fingerprint is smaller than the size of the data blocks, such a hash function cannot exist since there are fewer possible fingerprints than blocks. If the fingerprint is large enough and randomly distributed, this problem does not arise in practice. For a server of a given capacity, the likelihood that two different blocks will have the same hash value, also known as a collision, can be determined. If the probability of a collision is vanishingly small, we can be confident that each fingerprint is unique.
It is desirable that Venti employ a cryptographic hash function. For such a function, it is computationally infeasible to find two distinct inputs that hash to the same value. This property is important because it prevents a malicious client from intentionally creating blocks that violate the assumption that each block has a unique fingerprint. As an additional benefit, using a cryptographic hash function strengthens a client's integrity check, preventing a malicious server from fulfilling a read request with fraudulent data. If the fingerprint of the returned block matches the requested fingerprint, the client can be confident the server returned the original data.
Venti uses the [SHA-1] hash function developed by the US National Institute for Standards and Technology (NIST). SHA-1 is a popular hash algorithm for many security systems and, to date, there are no known collisions. The output of SHA-1 is a 160 bit (20 byte) hash value. Software implementations of SHA-1 are relatively efficient; for example, a 700Mhz Pentium 3 can compute the SHA-1 hash of 8 Kbyte data blocks in about 130 microseconds, a rate of 60 Mbytes per second.
Are the 160 bit hash values generated by SHA-1 large enough to ensure the fingerprint of every block is unique? Assuming random hash values with a uniform distribution, a collection of n different data blocks and a hash function that generates b bits, the probability p that there will be one or more collisions is bounded by the number of pairs of blocks multiplied by the probability that a given pair will collide, i.e.
Today, a large storage system may contain a petabyte (1015 bytes) of data. Consider an even larger system that contains an exabyte (1018 bytes) stored as 8 Kbyte blocks (~1014 blocks). Using the SHA-1 hash function, the probability of a collision is less than 10-20. Such a scenario seems sufficiently unlikely that we ignore it and use the SHA-1 hash as a unique identifier for a block. Obviously, as storage technology advances, it may become feasible to store much more than an exabyte, at which point it maybe necessary to move to a larger hash function. NIST has already proposed variants of SHA-1 that produce 256, 384, and 512 bit results. For the immediate future, however, SHA-1 is a suitable choice for generating the fingerprint of a block.
Any octet-aligned (that is, byte-sized) bitstreams can be stored.
The maximum length of a bitstream that can be stored in any given Bitcache repository is determined by the available storage and other constraints of the specific storage adapter.
Repository backends currently include:
| Backend | Standalone | Drupal |
|---|---|---|
| Unix file system | yes | yes |
| HTTP | yes | coming soon |
| Amazon S3 | yes | coming soon |
| MySQL | - | yes |
| PostgreSQL | - | yes |
| SQLite | - | yes |
| GDBM | yes | yes |
| Memcached | yes | - |
| SFTP | yes | - |
| FTP | - | - |
| TFTP | - | - |
The Bitcache project defines an unofficial URI scheme wherein URIs that start with bitcache:// denote a specific bitstream (also known as blob) without tying down the location where that bitstream is stored.
The format of this URI scheme is simply bitcache://<fingerprint>, where <fingerprint> is the unique SHA-1 fingerprint of the bitstream in question.
Dereferencing a bitcache:// URI should yield the actual location of the bitstream, that is, a URL (typically using the http:// or a file:// schemes) the content of which can be retrieved.
An example using the Drupal API:
<?php // "da39a3ee5e6b4b0d3255bfef95601890afd80709" is the SHA-1 // fingerprint for the empty string (that is, the string "") $uri = "bitcache://da39a3ee5e6b4b0d3255bfef95601890afd80709"; $url = bitcache_resolve_uri($uri); // $url now contains "file:///dev/null" $data = file_get_contents($url); // $data now contains ""
The bitcache:// URIs are particularly important for use with RDF, as they allow describing metadata about a particular bitstream (which means, usually, a particular file) without having to know or keep track of the location of that bitstream.