Frequently Asked Questions

For questions that are not addressed here, please write to the mailing list.

What is Bitcache?

Bitcache is a distributed content-addressable storage (CAS) system. It provides repository storage for bitstreams (colloquially known as blobs) of any length, each uniquely identified and addressed by a digital fingerprint derived through a secure cryptographic hash algorithm.

Bitcache provides a command-line client (called bit) for managing and querying repositories, a standalone HTTP server (bitcached) for serving or proxying repositories using a simple web-native REST API, and a synchronization utility (bitsync) for replicating bitstreams between repositories.

Who is using Bitcache?

Bitcache is presently used on several hundred websites or more as part of the File Framework media management solution for the popular Drupal content management system.

How do I install the Bitcache module for Drupal?

Requirements

The Bitcache module requires Drupal 6.x and PHP 5.2.x (or newer).

Installation

  1. Copy all the module files into a subdirectory called sites/all/modules/bitcache/ under your Drupal installation directory.
  2. Go to [Administer >> Site building >> Modules] and enable the Bitcache module. You will find it in the section labelled "Other".
  3. Go to [Administer >> Site configuration >> Data storage] to review and change the configuration options to your liking.

See also

File Framework for Drupal builds upon Bitcache to provide a comprehensive document and media management system for Drupal.

How do I install the Bitcache command-line tools?

Requirements

The Bitcache tools require Ruby 1.8 or newer, as well as RubyGems.

Installation

To install the latest development version from the GitHub repository:

$ sudo gem install bendiken-bitcache --source http://gems.github.com

How do I run a Bitcache server?

At present, the best option is to run the Drupal version of Bitcache.

Using the Ruby standalone server is, for the moment, undocumented.

What is a digital fingerprint?

In computer science, a digital fingerprinting algorithm is a procedure that maps an arbitrarily large bitstream (such as a computer file) to a much shorter bit string, its digital fingerprint, that uniquely identifies the original data for all practical purposes.

To serve its intended purposes, a fingerprinting algorithm must be able to capture the identity of a bitstream with virtual certainty. In other words, the probability of a collision — two bitstreams yielding the same fingerprint — must be so vanishingly negligible that it can be ignored for most practical purposes.

Cryptographic hash functions such as the Secure Hash Algorithm (SHA) set of algorithms generally serve as good fingerprint functions.

You are dealing with digital fingerprints every time you invoke md5sum or sha1sum to verify the integrity of a file:

$ echo "Hello, world" > hello.txt
$ sha1sum hello.txt
SHA1(hello.txt)= 7b4758d4baa20873585b9597c7cb9ace2d690ab8

Content-addressable storage relies on digital fingerprints serving as unique content identifiers. Bitcache by default makes use of the widely-deployed SHA-1 algorithm for digital fingerprints, with future plans to also support the SHA-2 family of algorithms.

What is content-addressable storage?

Here's what Wikipedia has to say:

Content-addressable storage, also referred to as associative storage or abbreviated CAS, is a mechanism for storing information that can be retrieved based on its content, not its storage location. [...] Roughly speaking, content-addressable storage is the permanent-storage analogue to content-addressable memory.

What are the benefits of content-addressable storage?

How likely are hash collisions?

Bitcache by default uses the SHA-1 algorithm for fingerprinting data.

Here follows a brief analysis of hash collision probabilities by the designers of the Plan 9 Venti file system, which also relies on SHA-1 for content addressability:

[Our design] requires a hash function that generates a unique fingerprint for every data block that a client may want to store. Obviously, if the size of the fingerprint is smaller than the size of the data blocks, such a hash function cannot exist since there are fewer possible fingerprints than blocks. If the fingerprint is large enough and randomly distributed, this problem does not arise in practice. For a server of a given capacity, the likelihood that two different blocks will have the same hash value, also known as a collision, can be determined. If the probability of a collision is vanishingly small, we can be confident that each fingerprint is unique.

It is desirable that Venti employ a cryptographic hash function. For such a function, it is computationally infeasible to find two distinct inputs that hash to the same value. This property is important because it prevents a malicious client from intentionally creating blocks that violate the assumption that each block has a unique fingerprint. As an additional benefit, using a cryptographic hash function strengthens a client's integrity check, preventing a malicious server from fulfilling a read request with fraudulent data. If the fingerprint of the returned block matches the requested fingerprint, the client can be confident the server returned the original data.

Venti uses the [SHA-1] hash function developed by the US National Institute for Standards and Technology (NIST). SHA-1 is a popular hash algorithm for many security systems and, to date, there are no known collisions. The output of SHA-1 is a 160 bit (20 byte) hash value. Software implementations of SHA-1 are relatively efficient; for example, a 700Mhz Pentium 3 can compute the SHA-1 hash of 8 Kbyte data blocks in about 130 microseconds, a rate of 60 Mbytes per second.

Are the 160 bit hash values generated by SHA-1 large enough to ensure the fingerprint of every block is unique? Assuming random hash values with a uniform distribution, a collection of n different data blocks and a hash function that generates b bits, the probability p that there will be one or more collisions is bounded by the number of pairs of blocks multiplied by the probability that a given pair will collide, i.e.

Today, a large storage system may contain a petabyte (1015 bytes) of data. Consider an even larger system that contains an exabyte (1018 bytes) stored as 8 Kbyte blocks (~1014 blocks). Using the SHA-1 hash function, the probability of a collision is less than 10-20. Such a scenario seems sufficiently unlikely that we ignore it and use the SHA-1 hash as a unique identifier for a block. Obviously, as storage technology advances, it may become feasible to store much more than an exabyte, at which point it maybe necessary to move to a larger hash function. NIST has already proposed variants of SHA-1 that produce 256, 384, and 512 bit results. For the immediate future, however, SHA-1 is a suitable choice for generating the fingerprint of a block.

What kind of data can be stored?

Any octet-aligned (that is, byte-sized) bitstreams can be stored.

The maximum length of a bitstream that can be stored in any given Bitcache repository is determined by the available storage and other constraints of the specific storage adapter.

What storage adapters are available?

Repository backends currently include:

Backend Standalone Drupal
Unix file system yes yes
HTTP yes coming soon
Amazon S3 yes coming soon
MySQL - yes
PostgreSQL - yes
SQLite - yes
GDBM yes yes
Memcached yes -
SFTP yes -
FTP - -
TFTP - -

What are these bitcache:// URLs?

The Bitcache project defines an unofficial URI scheme wherein URIs that start with bitcache:// denote a specific bitstream (also known as blob) without tying down the location where that bitstream is stored.

The format of this URI scheme is simply bitcache://<fingerprint>, where <fingerprint> is the unique SHA-1 fingerprint of the bitstream in question.

Dereferencing a bitcache:// URI should yield the actual location of the bitstream, that is, a URL (typically using the http:// or a file:// schemes) the content of which can be retrieved.

An example using the Drupal API:

<?php
// "da39a3ee5e6b4b0d3255bfef95601890afd80709" is the SHA-1
// fingerprint for the empty string (that is, the string "")
$uri  = "bitcache://da39a3ee5e6b4b0d3255bfef95601890afd80709";
$url  = bitcache_resolve_uri($uri); // $url now contains "file:///dev/null"
$data = file_get_contents($url);    // $data now contains ""

The bitcache:// URIs are particularly important for use with RDF, as they allow describing metadata about a particular bitstream (which means, usually, a particular file) without having to know or keep track of the location of that bitstream.