Getting Started with BagIt in 2018
Take two!
In December, I hastily wrote an update to an old post about BagIt, the Library of Congress' open-source specification for hierarchical packaging of files to support safe data storage and transfer. The primary motivation for the update was some issues that the Video Preservation course I work with encountered with my instructions for installing the bagit-python command-line tool, so I wanted to double-check my process there and make sure I was guiding readers correctly. I also figured that it had been a couple years and I could write about new implementations while I was at it. A cursory search turned up a BagIt-for-Ruby library, so I threw that in there, posted, *then* opened up a call for anything I'd missed.
It was at this point, as I sifted through the various scripts, apps, tools and libraries that create bags in some way that I realized I had lost the thread of what I was even trying to summarize or explain.
Every piece of software using the BagIt spec ever? That, happily, is a fool's errand - the whole point of the spec is that it's super easy and flexible to implement, no matter how short the script. So...there's a lot of implementations.
Every programming language with an available port/module for creating bags according to the BagIt spec? Mildly interesting for hybrid archivist/developers, but probably of less practical use for preservation students, or the average user/creator just trying to take care of their own files, or archivists that are less programming-inclined. A Ruby module for BagIt is objectively cool and useful - for those working/writing apps and scripts in Ruby. Given that setting up a Ruby dev environment requires some other command-line setup that I didn't even get into, someone's likely not heading straight to that module right out of the gate.
"Using BagIt" was/is the wrong framework. Too broad, too undefined, and as Ed Summers pointed out, antithetical to the spirit in which a simple, open source specification is made in the first place: to allow anyone to use it, anywhere, however they can - not according to one of four or five methods proscribed in a blog post.
So I am rewriting this post from the mindset, not of "here's all the forms and tools in which BagIt exists", but rather, "ok, so I'm learning what a bag is and why's it useful - how can I make one to get started?"
Because the contents of a specification are terrific and informative, but in my experience nothing reinforces understanding of a spec like a concrete example. And not only that, but one step further - *making* an example. Technical concepts without hands-on labwork or activities to solidify them get lost - and budding digital preservationists told to use the BagIt spec need somewhere to start.
So whether you're just trying to securely back up your personal files to a cloud service, or trying to get a GLAM institution's digital repository to be OAIS-compliant, validation and fixity starts at square one. Let me do that as well.
What's a bag?
Just for refresher's sake, I'm going to re-post here what I wrote back in 2016 - so that this post can stand alone as a primer:
One of the big challenges in digital archiving is file fixity – a fancy term for checking that the contents of a file have not been changed or altered (that the file has remained “fixed”). There’s all sorts of reasons to regularly verify file fixity, even if a file has done nothing but sit on a computer or server or external hard drive: to make sure that a file hasn’t corrupted over time, that its metadata (file name, technical specs, etc.) hasn’t been accidentally changed by software or an operating system, etc.
But one of the biggest threats to file fixity is when you move a file – from a computer to a hard drive, or over a server. Think of it kind of like putting something in the mail: there are a lot of points in the mailing process where a computer or USPS employee has to read the labeling and sort your mail into the proper bin or truck or plane so that it ends up getting to the correct destination. And there’s a LOT of opportunity for external forces to batter and jostle and otherwise get in your mail’s personal space. If you just slap a stamp on that beautiful glass vase you bought for your mother’s birthday and shove it in the mailbox, it’s not going to get to your mom in one piece.
So a “bag” is a kind of special digital container – a way of packaging files together to make sure what we get on the receiving end of a transfer is the same thing that started the journey (like putting that nice glass vase in a heavily padded box with “fragile” stamped all over it).
Sounds great! How do I make a bag?
At its core, all you need to make a bag out of a digital file or group of files is an editor capable of making plain text files (.txt) and an ability to generate MD5 checksums. An MD5 generator takes *any* string of digital information - including an entire file - and encodes it into a 128-bit fingerprint; that is, a 32-character string of seemingly "random" letters and numbers. Running an MD5 generator on the same file will always produce the same 32-character string. If the file changes in some way (even some change or edit invisible to the user), the MD5 string will change as well. So this process of generating and checking strings allows you to know whether a file is exactly the same on the receiving end of a transfer as it was at the beginning.
BagIt bags facilitate this process via a "tag manifest" - a text file including all the digital files contained in the bag (the "data" in question) and their corresponding MD5 checksums. Packaged together (along with some meta information on the BagIt spec and the bag itself), this all allows for convenient fixity checking.
Convenient, though, in the sense of easing automation. While you *can* put together a bag by hand - generating checksums for each file, copying them into text files to create the manifests, structuring the data and manifests together in BagIt's dictated hierarchy -that is a copy/paste nightmare, and not exactly going to encourage the computer-shy into healthier digipres practice.
This is why simple scripts and tools and apps are handy. Down the line, when you're creating your own archival workflow, you may want to find or tweak or make your own process for creating bags - but for your first bag, there's no need to reinvent the wheel.
I'm going to cover intro tools here, for either the command line or GUI user.
Command Line Tools
- this Bash script
A simple shell script by Ed that just requires just two arguments: the directory you want to bag, and an output directory (in which to put the bag).Hit the green "Download" button in the corner of the GitHub page, select the ZIP file, then unzip the result. Move the "bagit.sh" file inside to a convenient/accessible location in your computer.Once in Terminal, you can run this bash script by navigating to wherever you put this script, then executing it with:$ ./bagit.sh /path/to/directory /path/to/bag
or$ bash bagit.sh /path/to/directory /path/to/bag
(the "./" or "bash" commands do the same thing - indicating to the Bash terminal to execute the bagit.sh script)The "/path/to/directory" should be a folder containing all the files you want to be in the bag. Then you will specify the output path for the bag with "/path/to/bag". Both can be accomplished with drag-and-dropping folders from the Finder.
- bagit-python
Bagit-python is the Library of Congress's officially-supported command line utility for making and working with bags. It requires a working Python interpreter on your computer, plus Python's package manager, "pip". By default, macOS comes with a Python interpreter (2.7.10), but not pip. So we go to the popular command-line Mac package manager Homebrew to put this all together. Sigh. OK. So one of the reasons this post didn't come out last week is that, literally in that same time frame, Homebrew went through....something with regards to their Python packages and how they behaved with Python 2.x vs Python 3.x vs the Python installation that comes with your Mac. (they've locked/deleted a lot of the conversations and issues now, but it was really the dark side of FOSS projects in there for a bit). I kept trying to check my instructions were correct, and meanwhile, every "$ brew update" was sending my python installs haywire. It seems like they've finally settled, but, I'd still now generally recommend giving this page a once-over before working with python-via-homebrew.
But to summarize: if you want to work with Python 3.x, you install a *package* called "python" and then invoke it with python3 and pip3 commands. If you want to use Python 2.x, you install a package called "python@2" and then invoke with either python and pip or python2 and pip2 commands.
...got it?
For the purposes of just using the bagit-python command-line tool, at least, it doesn't matter whether you choose Python 2.x or 3.x. It'll work with both. But stick with one or the other through this installation process. So either:
$ brew install python
+ $ sudo pip3 install bagit
or: $ brew install python@2
+ $ sudo pip install bagit
That's it! It's just making sure you have a version of python installed through Homebrew, then use the python package/module installer "pip"to install the bagit-python tool. I highly recommend using admin privileges with "sudo" to globally install and avoid some weird permissions issues that may arise from trying to run python scripts and tools like bagit-python otherwise.
One installed, look over the help page with $ bagit.py --help
to see the command syntax - and all the features that you can cover! Including using different hash generators (rather than MD5), adding metadata, validating existing bags rather than creating new ones, etc.
*** a note about bagit-java*** If you are using Homebrew and just run $ brew install bagit
it will install the bagit-java 4.12.3 library and command-line tool. The LOC no longer supports and doesn't recommend this tool for command line use, and the --help instructions that come with it don't even actually reflect the command syntax you have to use to make it work. So! This isn't a recommendation but just a note for Homebrew users who might get confused about what's happening here.
GUIs
1. Bagger
Again, the LOC's official graphical utility program for creating and validating bags. Following the instructions from their GitHub repository linked above, you're going to download a release and then run on macOS by finding and clicking on the "bagger.jar" file (you'll need a working Java install as well).Inside Bagger, once you choose the "Create a Bag" option, Bagger will ask you to choose a "profile" - these just refer to the metadata fields available for inserting more administrative information about your bag and the files therein, within the bag itself. These are really useful for keeping metadata consistent if you're creating a whole bunch of bags, but choosing "<no profile>" is also totally acceptable to get started (you can always re-open bags and insert more metadata later!)"Create Bag in Place" is also a useful option if you don't want (or digital storage limitations even *prevents*) to have two copies of your files (the original + the copy inside the "data" folder in your bag). Rather than copying and creating the bag in a new directory elsewhere, it'll just move around/checksum/restructure the files according to the BagIt spec within the original directory.
2. Exactly
A GUI developed by AVP and the University of Kentucky that combines the bagging process with file transfer - which is the presumed end-goal of bagging in any case. To that end, Exactly doesn't "bag in place" - you always have to pick a source file/directory (or sources - Exactly will bundle them all together into one bag) and a destination for the resulting, created bag. Like Bagger, you can also add metadata via custom-designed fields or import templates/profiles. Added support for FTP or SFTP transfers to remote servers (in addition to locally-attached network storage units like a Samba or Windows share) make it a simple starter option for file delivery.
***************************
If you're getting started with the BagIt spec, these are the places I'd begin. But as to what implementation *you* can come up with from there, based on your personal/institutional needs...that's up to you!