FileFerret

FileFerret is (well... will be) an application for managing large quantities of files, especially multimedia, including files stored offline.

Details

FileFerret design

Features

This is just a very brief description typed while eating supper.

Manual Categorization

This is much like categorization in MediaWiki. Any file may belong to zero or more categories. Any category may belong to zero or more other categories. Categories are assigned by the user, not by algorithms – but the user may use algorithms to help locate items for categorization (see Search Management).
I should also be able to assign arbitrary fields and values to given files or groups of files, e.g. I should be able to label each music file with the song's artist, song title, album, etc. and search, sort, or group on those values. (This idea seems like it leads to a more complicated set of ideas which I will have to go into at a later time.)

Example

I keep scans of all my invoices, receipts, and bank deposits (well... all those I have time to scan). I also keep scans of various other paperwork, such as application forms. At present, I put these in folders largely based on where I have stored them. Usually they tightly packed into envelopes, since I am unlikely to ever need to access the physical piece of paper (99.9+% of the time)... but just in case I need to locate a physical document (as happened in Staddon vs. Griever, it is useful to know where it is; hence the filing system. However, this prevents me from filing them in more useful groups, such as "gas receipts", or "business expenditures" or "things I did on August 23rd, 1997". Ideally, I should be able to look up documents by any or all of these categories -- "business-related gas receipts from the week of August 23rd-30th, 1997", for example.

Versioning

Understands that different files may be representing the same "work", e.g. a digital photo in lossless TIFF, cleaned up a bit and saved to PNG, then reduced in size and saved to JPEG for use on the web -- three files, same basic image. Another example: audio digitized at 88.2 kHz and 32 bits, then cleaned up and normalized, then converted to 44.1 kHz and 16 bits (CD quality), and finally encoded as MP3 and OGG. Four different files, but the same work. We might need to access either the "cleaned up" high-res version or even the "raw" version later, but most of the time we only need the MP3 or OGG. For any given work, FileFerret should be able to quickly identify and locate all known files.

Search Management

This would be based on algorithmic searches where the criteria can be saved as a desktop object and updated at will or automatically – very similar to Apple's "smart folders" and Microsoft's "virtual folders" (see [1]), but with some strategic support for versioning and manual categorization. I want to be able to use an algorithm to find likely candidates for manual categorization, and to be able to see how/if files in the resulting set may already have been categorized. I would also like to be able to mark files as "categorized"; by default, a file with categories assigned would be marked only "partly categorized", and I would want to be able to indicate whether I was done assigning categories (and for this to show up in some obvious place). I think the best solution would be for the standard file display to be configurable as far as showing a file's membership (or non-membership) in user-selected categories with either a checkbox, a change of text color, change of icon, or whatever other clear indicators might be available in the UI. Column view should also allow data values to be displayed, e.g. if a music file was assigned a field called "artist", I should be able to put the "artist" column in my file view so I can see the artist for each file that had one assigned.

Offline Storage Management

Ability to maintain record of physical media on which copies of each file are stored

Ability to bring up thumbnails, lossy-compressed, low-resolution, or other low storage versions of file when available on local media
Retains "fingerprint" of file (probably a checksum of some kind) so as to detect data corruption and identical copies under different names
Maintains categorizations of offline files
Assists with determining when online files are no longer needed and should be moved offline; assists with producing compressed versions of files for online storage when appropriate (e.g. user may want lossless images in offline archive, but keep JPEGs online for quick reference)
(Low priority) Assists with determining when individual media may be discarded, e.g. if all files on a write-once disc have been saved to another disc, or superceded somehow

Applications Programming Interface (API)

Other programs may wish to interact with the database of files and categories; some examples:

a jukebox player could let me see all online audio files which have been categorized as music, broken down by artist, year, album, genre, songwriter, source (CD, vinyl, tape, download) or any combination. It could also fetch archived higher-quality lossless copies of those same files for burning to a CD.
A financial management program (of the Quicken / Microsoft Money genre) could show me images of paperwork relating to a transaction (checks, invoices, deposit slips, sales receipts, etc.)
A music-CD-burning program could locate the best source files for a selection of songs (stored locally as MP3s but stored losslessly on backup media), give me a list of media to load, copy the needed files from each one as I load it, and then burn the CD from the lossless files and remove them when done.
The program which manages images for vbz.net currently performs a lot of these tasks, but not very well. It maintains a list of available images for each item of merchandise in the store. Images can be any of several different standard sizes. Some image sizes are in two different formats, e.g. PNG and JPEG, where I was trying to see which version had the best trade-off of file size and image quality; the alternate versions are not used, but might be needed for comparison (if only so I'm not tempted to try again). Some items are available in different colors or styles, too, and the vbz image manager keeps track of which images go with which style. FileFerret would help in managing these images; the current vbz-specific program is limited to indexing files within a certain folder (and its sub-folders), is very slow to re-index the collection (i.e. adding any new image takes about half an hour, because the entire folder has to be re-scanned), and doesn't have any way of maintaining images offline or keeping track of originals.
After scanning a lot of photographs into lossless form (PNGs), FileFerret could automatically generate lossy (JPEG) versions of the files for online storage and ensure that the original lossless versions were archived offline and removed from local storage. Images marked "critical" could be automatically backup-archived on discs stored off-site in case of disaster.

Links

2008-12-16 Defeating Bedlam: blog entry about a couple of programs (Zotero and Papers) which attempt to solve some of the problems addressed by FileFerret. They fall into the trap of trying too hard to index content for you; it seems to me all that is really needed is the ability to automatically recognize duplicate files (which would solve the multiple-copies-downloaded problem) and a really good topic-organizer tool, both of which are core features of FF.
Virtual Volumes View implements the central idea behind FileFerret, but has some shortcomings:
- (major) "Virtual Folders" can only track entire volumes, not individual files (or maybe it can at least file folders, but still... sometimes stuff in a single folder needs to go in different places) – this made the Virtual View nearly useless to me, and made VVV useless as anything other than a searchable catalog of offline files
- Won't let you block particular folders (making it impractical to catalog volumes containing large numbers of irrelevant files)
- Infrequent screen updates
- Can't do anything while scan is in progress, including cancel or pause (without running another instance)
- No attempt to "fingerprint" files in order to identify duplicate data
- No attempt to help manage archiving of files, consolidation of redundant files, etc.
FileFerret directly addresses the problem described here (first paragraph)
It may be useful to adopt the freedesktop.org sharedfile-metadata-spec for naming keys in the database
This project may be similar; can't tell from the description
libferris also seems related, though it doesn't seem to incorporate the idea that from the user's point of view, the physical location of the file usually doesn't matter, and is only one of many possible ways to organize files (many possible "views" of the listing of all files)
Hierararchical storage filesystem
Offline media content database
Wayback: User-level Versioning File System for Linux
cvsFS: mount a CVS tree as a local filesystem