My Universe Logo

The Optimal Archive Format for Backtory

Posted by Jesco Freund at Feb. 13, 2010 9:11 p.m.

During the last few days, I've been thinking about what archive format to use with Backtory. First, everything seemed clear: I intended to use tar (resp. USTar) – a wide-spread standard, meaning backup data would have been accessible with any standard-compliant tar implementation as present among any Unix system I know. However, looking closer at the specification of the tar file format, it shows some weaknesses which turns it to be a bad choice for a differential backup tool. The worst of them are:

  • Tar archives have no index or table of content. This means the whole file has to be scanned to find out about its content, and extracting only one particular file means the same.
  • Tar only reserves 100 bytes for file names. Working with longer path names is possible, yet painful
  • Tar header information is encoded in ASCII, making it difficult working with international character sets
  • Tar archives do not handle arbitrary meta data, meaning
    • it is not possible to encode files in any way (encryption, compression, …) before adding them to an archive
    • it is impossible to store extended information like ACLs for a file without resorting to ugly workarounds (like creating .meta files for each file)
  • When it comes to compression, tar files can only be post-processed with a compressor, meaning the entire archive has to be decompressed before it can be scanned

This rant may give you some more reasons why tar is really a bad choice, but to sum it up: I consider it too painful downloading a complete backup archive to the local harddisk, decompress it there and then scan through the whole file just to restore one single file. Wouldn't it be much smarter if Backtory just had to download the archive header and then check which part of the archive actually has to be downloaded and post-processed (e. g. decompressed)? Even with stupid old FTP this would be possible – just by aborting RETR after x received bytes and using REST plus ABOR again to fetch a specific part of a file.

So how would the optimal archive format for Backtory look like? I think this can be best described by a list of requirements:

  • The archive must be indexed. Minimum requirement for an index would be a table providing (relative) path names and offsets to meta data and data section of the file. The index table must be located at a predictable or easily determinable position within the archive file.
  • For each file, a bunch of standard meta data must be stored (virtually the information provided by lstat)
  • The archive format must allow arbitrary meta data for each file. Some of them could be standardized (e. g. encoding or compression method), others may vary from application to application (e. g. ACL data, MAC labels, encryption method, …)
  • It must be easy and cheap to extend an archive. To be more precise, I would consider it inacceptable if something had to be inserted at the beginning of an archive file, entrailing a shift of all successive bytes, meaning all offset data have to be reacalculated and nearly all data in the file reorganized on the file system. “Easy” and “cheap” would be perfectly achieved if an archive could be extended even via FTP (but I consider this very unlikely to become true without breaking the other requirements).
  • Recoverability in case of a damaged index: By doing a single pass scan over the meta data and content data blocks, it should be possible to regain all information necessary to rebuild the archive index.

Well, that's it. However, I have not found any yet-existing archive format that covers these requirements (at least none which is patent-free and has at least one open source implementation). And before anyone starts rhapsodizing about xar, I'd like to state why I see xar to be unfit for Backtory:

  • The meta data block is in XML, which means you need a fat parser to process it.
  • The heap (all file content data) is useless without the meta data ⇒ recovery of a damaged toc is anything else than easy (I doubt it's possible at all).
  • The toc (XML meta data) is located at the beginning of a xar container. When extending a xar archive, all data inside must be relocated since its toc has to be extended.

I guess I have to design my own container format for Backtory. However, if I really should do so, I would implement it as a library independent of backtory and give it its own CLI tool. I already got some idea how the archive container could look like, but there are still some details I have to work out. Stay tuned, I'll keep you informed about what I'll do and how I will implement it…

No comments | Defined tags for this entry: backtory, code, programming

Comments

No comments