SkayaWiki:BlastIdx

(this is just the "old" blastidx.txt file, revamped a bit)

Introduction

The goal of BlastIdx was initially to index my movies and music collection by MD5 checksum, to :

find duplicate files easily (files with the same checksum are generally equal)
search my collection easily (more easily than with locate anyway)
compare my collection with the one of my friends, to check for missing files
access the index thru a web-based interface (for easy remote access)

BlastIdx will eventually evolve into a full-featured backup tool, but for now it's strictly aimed at indexing.

How does it work

BlastIdx indexes a collection of files in a SQL database.
The database knows the name, path, size, md5 hash, modification time, and host,
of each file[1]. This allows "locate-like" queries (using a part of the file
name), but also detection of duplicate files (using the md5 hash), or listing
of the most recent files (using modification time). The host attribute allows
to use a single database to index contents spanning multiples computers, and
detect (for instance) missing files in a replication scheme.

[1] In fact, there is also some other useful things there ; like the time
when the file was indexed, and a timestamp indicating when the file was last
checked by the indexer. The "indexation time" is useful to replicate the
index across another database (to get only newer records), or to check which
contents were recently added. The "last checked time" is just a heartbeat
to notify that the file was properly checked by the indexer.

Database Structure

A main table, "files", holds file information. There is one row per file.
A file which exists on many hosts will have one row per host. Also, if a
file is modified, the old entries remain (until some cleanup job purges
them), allowing searches for changed names or old files.

Another table, "conf", holds the configuration of the program. Storing the
configuration in the DB allows index query and customization thru a single
interface. This configuration table works like a hash-table, where one
key yields multiples values. The configuration stores the paths that the
indexer has to scan, for example. Each entry in the "conf" table has a
"hostname", allowing to store all configuration entries in a single table
of a centralized database.

A third table, "logs", allows to keep track of the run time of the indexer job.
It holds the begin and end time of each run of the indexer process. This is
useful to know the time range of "valid" entries (an entry if valid if it
"last indexed time" is in the time span of the last index run).

Implementation

BlastIdx is in its current version a single Python script, the configuration
being embedded in the first lines of the script. The script is self documented
and includes an extensive help and a tutorial, and should require no further
documentation as long as someone knows the purpose of the program.

A cron job is responsible for updating the database. It gets a number of
paths from the configuration database, and walks them recursively. Each file
is checked according to his full path, name and size ; if a matching record
is found, the file is considered as "unchanged", and only the "last indexed
time" is updated. Else, the md5 hash is computed (this can take a long time
on big files!) and a new record is inserted.

Note that there can be multiple records for the same file path and name,
but with different sizes ; also, a file can be indexed in the database,
but not exist anymore : this allows to keep track of moved/deleted/changed
files.

Clustering

One goal of this project is to allow centralized indexation, to offer a
single query interface to multiple hosts. This is achieved thru the use of
the "host" field in each table.

Another goal is to allow disconnected operation and replication ; that is :
merging of indexes. This is possible thanks to the "index time" field, allowing
to fetch entries that are newer than a given timestamp. A database can merge
records from another database using this mechanism, and duplicating its "logs"
records to keep track of the status of records.

Download

blastidx.py
kdix.py (no documentation yet!)