(this is just the "old" blastidx.txt
file, revamped a bit)
The goal of BlastIdx
was initially to index my movies and music collection by MD5 checksum, to :
- find duplicate files easily (files with the same checksum are generally equal)
- search my collection easily (more easily than with locate anyway)
- compare my collection with the one of my friends, to check for missing files
- access the index thru a web-based interface (for easy remote access)
will eventually evolve into a full-featured backup tool, but for now it's strictly aimed at indexing.
How does it work
indexes a collection of files in a SQL database.
The database knows the name, path, size, md5 hash, modification time, and host,
of each file. This allows "locate-like" queries (using a part of the file
name), but also detection of duplicate files (using the md5 hash), or listing
of the most recent files (using modification time). The host attribute allows
to use a single database to index contents spanning multiples computers, and
detect (for instance) missing files in a replication scheme.
 In fact, there is also some other useful things there ; like the time
when the file was indexed, and a timestamp indicating when the file was last
checked by the indexer. The "indexation time" is useful to replicate the
index across another database (to get only newer records), or to check which
contents were recently added. The "last checked time" is just a heartbeat
to notify that the file was properly checked by the indexer.
A main table, "files", holds file information. There is one row per file.
A file which exists on many hosts will have one row per host. Also, if a
file is modified, the old entries remain (until some cleanup job purges
them), allowing searches for changed names or old files.
Another table, "conf", holds the configuration of the program. Storing the
configuration in the DB allows index query and customization thru a single
interface. This configuration table works like a hash-table, where one
key yields multiples values. The configuration stores the paths that the
indexer has to scan, for example. Each entry in the "conf" table has a
"hostname", allowing to store all configuration entries in a single table
of a centralized database.
A third table, "logs", allows to keep track of the run time of the indexer job.
It holds the begin and end time of each run of the indexer process. This is
useful to know the time range of "valid" entries (an entry if valid if it
"last indexed time" is in the time span of the last index run).
is in its current version a single Python script, the configuration
being embedded in the first lines of the script. The script is self documented
and includes an extensive help and a tutorial, and should require no further
documentation as long as someone knows the purpose of the program.
A cron job is responsible for updating the database. It gets a number of
paths from the configuration database, and walks them recursively. Each file
is checked according to his full path, name and size ; if a matching record
is found, the file is considered as "unchanged", and only the "last indexed
time" is updated. Else, the md5 hash is computed (this can take a long time
on big files!) and a new record is inserted.
Note that there can be multiple records for the same file path and name,
but with different sizes ; also, a file can be indexed in the database,
but not exist anymore : this allows to keep track of moved/deleted/changed
One goal of this project is to allow centralized indexation, to offer a
single query interface to multiple hosts. This is achieved thru the use of
the "host" field in each table.
Another goal is to allow disconnected operation and replication ; that is :
merging of indexes. This is possible thanks to the "index time" field, allowing
to fetch entries that are newer than a given timestamp. A database can merge
records from another database using this mechanism, and duplicating its "logs"
records to keep track of the status of records.