Getting Started with BagIt in 2018

Take two!

In December, I hastily wrote an update to an old post about BagIt, the Library of Congress’ open-source specification for hierarchical packaging of files to support safe data storage and transfer. The primary motivation for the update was some issues that the Video Preservation course I work with  encountered with my instructions for installing the bagit-python command-line tool, so I wanted to double-check my process there and make sure I was guiding readers correctly. I also figured that it had been a couple years and I could write about new implementations while I was at it. A cursory search turned up a BagIt-for-Ruby library, so I threw that in there, posted, *then* opened up a call for anything I’d missed.

Uhhhh –  I missed a lot.

It was at this point, as I sifted through the various scripts, apps, tools and libraries that create bags in some way that I realized I had lost the thread of what I was even trying to summarize or explain.

Every piece of software using the BagIt spec ever? That, happily, is a fool’s errand – the whole point of the spec is that it’s super easy and flexible to implement, no matter how short the script. So…there’s a lot of implementations.

Every programming language with an available port/module for creating bags according to the BagIt spec? Mildly interesting for hybrid archivist/developers, but probably of less practical use for preservation students, or the average user/creator just trying to take care of their own files, or archivists that are less programming-inclined. A Ruby module for BagIt is objectively cool and useful – for those working/writing apps and scripts in Ruby. Given that setting up a Ruby dev environment requires some other command-line setup that I didn’t even get into, someone’s likely not heading straight to that module right out of the gate.

“Using BagIt” was/is the wrong framework. Too broad, too undefined, and as Ed Summers pointed out, antithetical to the spirit in which a simple, open source specification is made in the first place: to allow anyone to use it, anywhere, however they can – not according to one of four or five methods proscribed in a blog post.

So I am rewriting this post from the mindset, not of “here’s all the forms and tools in which BagIt exists”, but rather, “ok, so I’m learning what a bag is and why’s it useful – how can I make one to get started?”

Because the contents of a specification are terrific and informative, but in my experience nothing reinforces understanding of a spec like a concrete example. And not only that, but one step further – *making* an example. Technical concepts without hands-on labwork or activities to solidify them get lost – and budding digital preservationists told to use the BagIt spec need somewhere to start.

So whether you’re just trying to securely back up your personal files to a cloud service, or trying to get a GLAM institution’s digital repository to be OAIS-compliant, validation and fixity starts at square one. Let me do that as well.

What’s a bag?

Just for refresher’s sake, I’m going to re-post here what I wrote back in 2016 – so that this post can stand alone as a primer:

One of the big challenges in digital archiving is file fixity – a fancy term for checking that the contents of a file have not been changed or altered (that the file has remained “fixed”). There’s all sorts of reasons to regularly verify file fixity, even if a file has done nothing but sit on a computer or server or external hard drive: to make sure that a file hasn’t corrupted over time, that its metadata (file name, technical specs, etc.) hasn’t been accidentally changed by software or an operating system, etc.

But one of the biggest threats to file fixity is when you move a file – from a computer to a hard drive, or over a server. Think of it kind of like putting something in the mail: there are a lot of points in the mailing process where a computer or USPS employee has to read the labeling and sort your mail into the proper bin or truck or plane so that it ends up getting to the correct destination. And there’s a LOT of opportunity for external forces to batter and jostle and otherwise get in your mail’s personal space. If you just slap a stamp on that beautiful glass vase you bought for your mother’s birthday and shove it in the mailbox, it’s not going to get to your mom in one piece.

So a “bag” is a kind of special digital container – a way of packaging files together to make sure what we get on the receiving end of a transfer is the same thing that started the journey (like putting that nice glass vase in a heavily padded box with “fragile” stamped all over it).

Sounds great! How do I make a bag?

At its core, all you need to make a bag out of a digital file or group of files is an editor capable of making plain text files (.txt) and an ability to generate MD5 checksums. An MD5 generator takes *any* string of digital information – including an entire file – and encodes it into a 128-bit fingerprint; that is, a 32-character string of seemingly “random” letters and numbers. Running an MD5 generator on the same file will always produce the same 32-character string. If the file changes in some way (even some change or edit invisible to the user), the MD5 string will change as well. So this process of generating and checking strings allows you to know whether a file is exactly the same on the receiving end of a transfer as it was at the beginning.

BagIt bags facilitate this process via a “tag manifest” – a text file including all the digital files contained in the bag (the “data” in question) and their corresponding MD5 checksums. Packaged together (along with some meta information on the BagIt spec and the bag itself), this all allows for convenient fixity checking.

Convenient, though, in the sense of easing automation. While you *can* put together a bag by hand – generating checksums for each file, copying them into text files to create the manifests, structuring the data and manifests together in BagIt’s dictated hierarchy -that is a copy/paste nightmare, and not exactly going to encourage the computer-shy into healthier digipres practice.

This is why simple scripts and tools and apps are handy. Down the line, when you’re creating your own archival workflow, you may want to find or tweak or make your own process for creating bags – but for your first bag, there’s no need to reinvent the wheel.

I’m going to cover intro tools here, for either the command line or GUI user.

Command Line Tools

  1. this Bash scriptA simple shell script by Ed that just requires just two arguments: the directory you want to bag, and an output directory (in which to put the bag).Hit the green “Download” button in the corner of the GitHub page, select the ZIP file, then unzip the result. Move the “bagit.sh” file inside to a convenient/accessible location in your computer.Once in Terminal, you can run this bash script by navigating to wherever you put this script, then executing it with:[cc lang=”Bash”]$ ./bagit.sh /path/to/directory /path/to/bag[/cc]
    or
    [cc lang=”Bash”]$ bash bagit.sh /path/to/directory /path/to/bag[/cc]

    (the “./” or “bash” commands do the same thing – indicating to the Bash terminal to execute the bagit.sh script)

    The “/path/to/directory” should be a folder containing all the files you want to be in the bag. Then you will specify the output path for the bag with “/path/to/bag”. Both can be accomplished with drag-and-dropping folders from the Finder.

  2. bagit-pythonBagit-python is the Library of Congress’s officially-supported command line utility for making and working with bags. It requires a working Python interpreter on your computer, plus Python’s package manager, “pip”. By default, macOS comes with a Python interpreter (2.7.10), but not pip. So we go to the popular command-line Mac package manager Homebrew to put this all together.Sigh. OK. So one of the reasons this post didn’t come out last week is that, literally in that same time frame, Homebrew went through….something with regards to their Python packages and how they behaved with Python 2.x vs Python 3.x vs the Python installation that comes with your Mac. (they’ve locked/deleted a lot of the conversations and issues now, but it was really the dark side of FOSS projects in there for a bit). I kept trying to check my instructions were correct, and meanwhile, every “$ brew update” was sending my python installs haywire. It seems like they’ve finally settled, but, I’d still now generally recommend giving this page a once-over before working with python-via-homebrew.

But to summarize: if you want to work with Python 3.x, you install a *package* called “python” and then invoke it with python3 and pip3 commands. If you want to use Python 2.x, you install a package called “python@2” and then invoke with either python and pip or python2 and pip2 commands.

…got it?

For the purposes of just using the bagit-python command-line tool, at least, it doesn’t matter whether you choose Python 2.x or 3.x. It’ll work with both. But stick with one or the other through this installation process. So either:

[cc lang=”Bash”]$ brew install python[/cc]
+
[cc lang=”Bash”]$ sudo pip3 install bagit[/cc]

or:
[cc lang=”Bash”]$ brew install python@2[/cc]
+
[cc lang=”Bash”]$ sudo pip install bagit[/cc]

That’s it! It’s just making sure you have a version of python installed through Homebrew, then use the python package/module installer “pip”to install the bagit-python tool. I highly recommend using admin privileges with “sudo” to globally install and avoid some weird permissions issues that may arise from trying to run python scripts and tools like bagit-python otherwise.

One installed, look over the help page with [cc lang=”Bash”]$ bagit.py –help[/cc] to see the command syntax – and all the features that you can cover! Including using different hash generators (rather than MD5), adding metadata, validating existing bags rather than creating new ones, etc.

*** a note about bagit-java***
If you are using Homebrew and just run [cc lang=”Bash”]$ brew install bagit[/cc] it will install the bagit-java 4.12.3 library and command-line tool. The LOC no longer supports and doesn’t recommend this tool for command line use, and the –help instructions that come with it don’t even actually reflect the command syntax you have to use to make it work. So! This isn’t a recommendation but just a note for Homebrew users who might get confused about what’s happening here.

GUIs

1. Bagger

Again, the LOC’s official graphical utility program for creating and validating bags. Following the instructions from their GitHub repository linked above, you’re going to download a release and then run on macOS by finding and clicking on the “bagger.jar” file (you’ll need a working Java install as well).Inside Bagger, once you choose the “Create a Bag” option, Bagger will ask you to choose a “profile” – these just refer to the metadata fields available for inserting more administrative information about your bag and the files therein, within the bag itself. These are really useful for keeping metadata consistent if you’re creating a whole bunch of bags, but choosing “<no profile>” is also totally acceptable to get started (you can always re-open bags and insert more metadata later!)”Create Bag in Place” is also a useful option if you don’t want (or digital storage limitations even *prevents*) to have two copies of your files (the original + the copy inside the “data” folder in your bag). Rather than copying and creating the bag in a new directory elsewhere, it’ll just move around/checksum/restructure the files according to the BagIt spec within the original directory.

2. Exactly

A GUI developed by AVP and the University of Kentucky that combines the bagging process with file transfer – which is the presumed end-goal of bagging in any case. To that end, Exactly doesn’t “bag in place” – you always have to pick a source file/directory (or sources – Exactly will bundle them all together into one bag) and a destination for the resulting, created bag. Like Bagger, you can also add metadata via custom-designed fields or import templates/profiles. Added support for FTP or SFTP transfers to remote servers (in addition to locally-attached network storage units like a Samba or Windows share) make it a simple starter option for file delivery.

***************************

If you’re getting started with the BagIt spec, these are the places I’d begin. But as to what implementation *you* can come up with from there, based on your personal/institutional needs…that’s up to you!

Classroom Access to Interactive DVDs

Normally my focus as MIAP Technician is on classroom support for courses in the MIAP  M.A. curriculum – but, as a staff member of the wider NYU Cinema Studies department, there are occasionally cases where I can assist non-MIAP Cinema Studies courses with a need for archival or legacy equipment.

That was the case recently with a Fall 2017 course called “Interactive Cinema & New Media”, which challenged the skills I learned in MIAP regarding disk imaging, emulation, and legacy computing, and provides, I think, an interesting case study regarding ongoing access to multimedia software-based works from the ’90s and early 2000s.

In this project I worked closely with Marina Hassapopoulou, the Visiting Assistant Professor teaching the course; Ina Cajulis, recently hired as the department’s Special Events/Study Center Coordinator (also a Cinema Studies M.A. graduate who took several MIAP classes, including Handling Complex Media, the course most focused on interactive moving image works); and Cathy Holter, Cinema Studies Technical Coordinator.

Last fall when Marina was teaching “Interactive Cinema”, I worked briefly with her request to give students access to a multimedia work by Toni Dove, called “Sally or the Bubble Burst”. “Sally” is an interactive DVD-ROM in which users can navigate various menus, watch videos, and interact (sometimes via the keyboard, sometimes using audio input and speech recognition software) with a number of characters, primarily Sally Rand, a burlesque dancer from the mid-20th century. Because it was created/released in 2003, “Sally” has some unique technical requirements: namely, a PowerPC Mac running either OS 9.1-9.2 or OSX 10.2-10.6. At the time, we had to move quickly to make the DVD available for the class – after testing the disc on a couple of legacy OSX laptops from the Old Media Lab, we decided to temporarily keep an old PowerPC iBook running OSX 10.5 in the department’s Study Center lounge, where students from the “Interactive Cinema” course could book time to view “Sally”. This overall worked fine, although there was some amount of lag (some futzy and not-great sounds coming from the laptop’s internal disc drive made me prefer to run the disc off of a USB external drive – better for the disc’s physical safety, worse for its data rate), and the disc’s speech recognition components were not responsive, likely an issue with the laptop’s sound card.

Fast-forward to August 2017. Submitting her screening list for the semester, Marina let us know that not only would she be needing students to have access to “Sally or the Bubble Burst” again, but she was also expanding the course syllabus to include a number of similar interactive software-based works (by which, I’ll define, I mean CD- or DVD-ROMs with moving image material that require specific computer hardware or software components; not just an interactive DVD that will still play back in any common DVD player, which Marina also includes in her course but provide much less of a technical challenge). With more time to plan, I was interested in both more extended testing, to make sure “Sally” and all these works ran more as intended; and to have a discussion with Marina, Cathy, and Ina so we could strategize longer-term plans for access to these works. Quite simply, we are lucky that the department has (largely, I think, thanks to the presence of the MIAP program) over the years maintained a varied collection of legacy computers that could now run/test these works – we may not continue to be so lucky as the years wear on.

The alternative is pretty straightforward: migrate the content on these DVD-ROMs to file-based disk images, and run them through emulators or virtual machines on contemporary computer hardware rather than worn-down, glitchy, eventually-going-to-break legacy machines. But the questions with these kinds of access projects are always, A) has the content really been properly migrated/recreated, and B) does the experience of using the work on contemporary hardware acceptably recreate the experience of the work on its originally-intended hardware. The latter in particular was a question I could not answer on my own – without having seen, interacted with or studied these works in any detail, I did not consider myself in a position to judge whether emulated versions of these works were running as intended, in a manner acceptable for intense, classroom study. Marina and Ina, as scholars of interactive cinema and digital humanities, were in a better position to make an informed decision.

So, my initial goals were:

  • prepare a demo of emulated/virtualized works
  • match each interactive DVD with a legacy computer on which it ran best, for comparison’s sake, or, failing the emulation route, providing access to students

I set aside “Sally or the Bubble Burst”, as its processor/OS requirements put it squarely in the awkward PowerPC + Mac OSX zone that has proven difficult for emulation software and plagued my nightmares in the past. That left three discs to work with, listed here along with the technical requirements outlined in their documentation:

I wasn’t looking to perform bit-for-bit preservation/migration with this project. We still have the discs and their long-term shelf life will be a concern for another day – today I wanted acceptable emulation of the media contained on them. So by Occam’s Razor, I considered Mac’s Disk Utility app to be the quickest and best solution in this case to make disk images for demo and testing.

After selecting a disc in Disk Utility’s side menu, I browsed to Disk Utility’s File menu, selected “New Image” and then “Image from [name_of_disc]”.

Screen Shot 2017-09-29 at 1.43.11 PM.png

I selected the “CD/DVD master” option with no encryption, which, after a few minutes, creates a .cdr file. This was repeated three times, once for each disc.

Screen Shot 2017-09-29 at 1.44.01 PM.png

With a .cdr disk image ready for each work, now it was time to set up an emulated legacy OS environment to test them in. I decided to start with Mac OS 9 – an environment I was already familiar with and which matched at least the OS requirements of all three works.

For emulating Mac OS 8.0 through 9.0.4, I’ve had a lot of success with a program called SheepShaver. Going through all the steps to set up SheepShaver is its own walk-through – so I’m not even going to attempt to recreate it here, and instead just direct you to the thorough guide on the Emaculation forums, which is what I use anyway. (the only question generally is, where to get installation discs or disk images for legacy operating systems – we have a number still floating around the department, but I also have WinWorld bookmarked for all my abandoned software needs).

Once I got a working Mac OS 9 computer running in SheepShaver, I could go into SheepShaver’s preferences and mount the disk images I made earlier of “Immemory”, “Bleeding Through” and “Artintact” as Volumes, so that on rebooting SheepShaver, these discs will now appear on the emulated desktop, just as if we had inserted the original physical discs into an OS 9 desktop.

Screen Shot 2017-09-29 at 12.52.23 PM.png

First off I tried “Immemory”, the oldest work which also only required QuickTime v. 4.0 – which is the default version that comes packaged with Mac OS 9. I couldn’t be sure it was running exactly as planned, but the sound and moving images on the menu played smoothly, and I could navigate through the program with ease (well, relative ease – spotting the location of your cursor is often difficult in “Immemory”, but from reading through the instructions that seemed likely to be part of the point).

Screen Shot 2017-09-29 at 12.43.54 PM.png

Screen Shot 2017-09-29 at 12.44.29 PM.png

The next challenge was that “Bleeding Through” and “Artintact” required higher versions of QuickTime than 4.0. How do you update an obsolete piece of software in a virtual machine? First, scour the googles and the duckduckgos some more until you find another site offering abandonware (WinWorld, unfortunately, only offered up the QuickTime 4.0 installer). Yes, you need to be careful about this – plenty of trolls and far more malevolent actors are out there offering “useful” downloads that turn out to be malware. Generally I’m going to be a little more trusting of a site offering me QuickTime 5.0 than one offering QuickTime X – ancient software that only runs on obsolete or emulated equipment isn’t exactly a very tempting lure, if you’re out phishing. But, still something to watch out for. Intriguingly, I found a site called OldApps.com, similar to WinWorld, in that it has a stable, robust interface, a very active community board, and at least offers checksum information for (semi-)secure downloads. Lo and behold, a (I’m pretty sure) safe QuickTime 6.0.4 installer!

Screen Shot 2017-09-29 at 12.59.34 PM.png

With that program downloaded, now I had to get that into the virtual Mac OS 9 environment. Luckily, SheepShaver offers up some simple instructions for creating a “Shared” folder to shuttle files back and forth between your emulated desktop and your real one.

Screen Shot 2017-09-29 at 12.56.19 PM

Screen Shot 2017-09-29 at 12.57.52 PM.png

With the QuickTime 6 installer moved into my virtual environment, I could run it and ta-da: now the SheepShaver VM has QuickTime 6 in Mac OS 9!

Screen Shot 2017-09-29 at 12.53.51 PM.png

This is the point where I admit – everything had gone so swimmingly that I got a bit cocky. With the tech requirements fulfilled and the OS 9 environment set up, I went into the demo session with Marina, Ina and Cathy without having fully tested all three discs myself beforehand on the hardware it was going to run. And the results were…unideal. The color scheme on the menu for “The Complete Artintact”, supposed to be rendered in bright primary colors, was clearly off:

Screen Shot 2017-09-29 at 12.46.29 PM.png

Audio on “Bleeding Ground” played correctly, but there was no video, and the resolution on the menus was all off and difficult to control:

Screen Shot 2017-09-29 at 12.48.58 PM.png

Screen Shot 2017-09-29 at 12.49.27 PM.png

And even “Immemory”, which had run so smoothly at the start, now had clear interruptions in the audio, broken videos, and transitions between slides/pages were clunky and stuttered.

Though Marina came away impressed with the virtual OS 9 environment and the general idea of using emulators rather than the original media to provide access, the specific results were clearly not acceptable for scrutinous class use. Running some more tests and troubleshooting, I came to two conclusions: first, the iMac we were trying to install SheepShaver on in the Study Center was several years old, and probably not funneling enough processing power to the emulated computer to run everything smoothly. But also, I suspect that the OS 9 virtual machine was missing some system components or plugins for the later works (“The Complete Artintact” and “Bleeding Through”), and that the competing requirements (different versions of QuickTime in particular) was causing confusion when crammed together in one virtual environment – in other words, using QuickTime 6 was actually *too advanced* to run “Immemory”, designed for QuickTime 4.

So, solutions:

  • keep “Immemory” isolated in its own SheepShaver/OS 9 virtual machine with QuickTime 4
  • test “The Complete Artintact” and “Bleeding Through” in a virtual Windows machine, for comparison against different default OS components
  • install everything on a brand-new, more souped-up iMac

Success! Kept alone in its own virtual Mac OS 9 machine with QuickTime 4, “Immemory” went back to running smoothly. Using a different piece of emulation/virtualization software called VirtualBox (maintained by Oracle, and designed primarily to run Windows and Linux VMs), and going back to WinWorld and OldApps for legacy installers, I created a Windows 2000 virtual machine running QuickTime 6 for Windows for “The Complete Artintact” and “Bleeding Through” (settings in screenshot):

Screen Shot 2017-09-29 at 12.16.08 PM.png

Installed on new, powerful hardware (2016 iMac running macOS 10.12) that could correctly/quickly funnel plenty of CPU power and RAM to virtual machines, the works now looked “right” to me, and a second demo with Marina and Ina confirmed:

Screen Shot 2017-09-29 at 12.37.41 PM.pngScreen Shot 2017-09-29 at 12.39.34 PM.png

Screen Shot 2017-09-29 at 12.37.57 PM.png

Screen Shot 2017-09-29 at 12.38.35 PM

(The one last hitch: in “The Complete Artintact”, which is really an anthology collection of a number of interactive software works, some of the pieces had glitchy audio. Luckily, this was solved using VirtualBox’s sound settings, switching to a different virtualized audio controller, from “ICH AC97” to “Soundblaster 16”):

Screen Shot 2017-09-29 at 12.19.32 PM.png

There were a few more setup steps to make accessing the works easier to the students: creating desktop shortcuts for the virtual machines on the iMac desktop AND the disk images inside the virtual machines (so that students could just click through straight to the work, rather than navigating file systems on older, unfamiliar operating systems); adding an extra virtual optical drive to the Windows 2000 VM so that the VM could be booted up with both “Artintact” and “Bleeding Through” loaded at the same time; and creating a set of instructions and tips for the students to follow regarding navigating these emulators and legacy operating systems (for troubleshooting purposes).

Screen Shot 2017-09-29 at 12.36.35 PM.png

That left legacy testing for backup, as well as the question of “Sally and the Bubble Burst”. At this point time was running short, and emulating “Sally” seemed likely to be a more difficult and prolonged process. Luckily, we had an iMac running Mac OSX 10.6 (Snow Leopard), which includes Mac’s Rosetta software installed for running PowerPC applications (like “Sally”) on Intel machines. A disk image of Toni Dove’s work runs smoothly on that machine, including speech recognition input via the iMac’s built-in mic.

I did also run “Immemory”, “Bleeding Through” and “The Complete Artintact” on a Apple G4 desktop running OS 9 and QuickTime 6 – for whatever reason, running this combination of discs and software on the original hardware, as opposed to in the SheepShaver VM, did work acceptably. Though at this point, we accepted the emulation solution for class access, if at any point anything goes wrong, we can move that G4 from the Old Media Lab to the Study Center and run all the discs (or disk images) on that legacy machine, rather than the squiffy laptop solution that we used for “Sally or the Bubble Burst” a year ago.

So, there is the saga of “Interactive Cinema”. Aside from all this is concern that the disk images I made for this process don’t really constitute bit-for-bit preservation, and though Marina thought they were all running as intended, these are incredibly broad works and exploring and testing every detail manually was basically impossible. Ultimately, we may want to create forensic disk images off of the CD- and DVD-ROMs to ensure that we’re really capturing all the data and can ensure access to them in the future. But for now….it’s time for me to take a break!

DigiPres Machines

Early this summer, the department learned a development proposal we had submitted to improve MIAP’s computer hardware was successfully funded. This was excellent news – our video digitization stations were not the only piece of equipment that had started lagging a bit and creating choke points around archival projects. But besides breathing new life into some of our existing workflows, this project also presented an exciting new opportunity.

For the past couple of years, we’d had three contemporary (~2013) MacBook Pro laptops available in our digital forensics lab. These were intended, less for forensics really, and more as a general purpose resource – for MIAP students to use in any course, say, if they forgot their own laptop at home, or their own laptop was having difficulty during a hands-on lab exercise. These were, I think, quite a bit of success, and these laptops continue to be used on the regular. So I ultimately viewed these three computers as a sort of pilot program, justifying a larger fleet of similar laptops that could potentially serve an entire class of 10 or 11 students at the same time (our usual cohort size) rather than just three people.

IMG-3177.JPG

So, in addition to the MacBook Pros already there, we had the chance to add eight more laptops to this supply. Perhaps even more excitingly (for me, anyway), I had the chance to design, from the ground-up, what a “MIAP laptop” should look like – a laptop that, from startup, would be of use to our students in even vaguely digital-related courses… which, yes, is all of them, but especially: Metadata, Digital Literacy, Handling Complex Media, or Digital Preservation. They might also be used for workshops, like my own Talking Tech series, or for other invited guest speakers. What would the foundation for all this look like? What programs would you install to educate and demonstrate a wide variety of digital preservation and administration tasks: disk imaging, metadata extraction, metadata manipulation and transformation, digital forensics, file transfer and packaging, fixity management, on and on… while minimizing the time spent installing and managing software later on?

What follows is a detailed break-down of each component of these machines: hardware, operating system, and software (both graphical and command line), along with, if prudent, what each program does and why it was selected. When applicable I’ll link out so you can download or investigate each component yourself and your needs (probably less broad than mine).

Thanks to all those who offered feedback to that initial request, or have inspired these choices via many other professional tasks and avenues.

The Hardware

gallery1_2256

  • 13” MacBook Air, mid-2016
  • 2.2 GHz i7 Intel Core processor
  • 8 GB RAM
  • 512 GB SSD storage

Why:

Apple hardware has a proven durability – the MIAP media lab houses a fleet of legacy/leftover Mac laptops and desktops stretching back to PowerBooks and a Macintosh SE that are (knock on wood) largely still functional. Anecdotally, the majority of our students have come into the program as Mac users, and are therefore already familiar with using Apple hardware.

Also, MIAP’s digital preservation/literacy instruction (as with the fields of digital preservation and open source software development at large) is currently favored toward Unix environments (i.e. command line interfacing via the Bash shell, GNU/Linux and BSD utilities). While the rapid development of the Windows Subsystem for Linux in Windows 10 offers tantalizing opportunities for (cheaper) crossover Unix/Windows education in the near future, lingering quirks in the WSL give Apple hardware the “it just works” advantage when it comes to teaching new students digipres basics. (pure Linux machines were not considered, assuming students would be unfamiliar and thus uncomfortable with such platforms, and that our graduates, not necessarily going into systems administration, are more likely to encounter Mac and Windows machines in their archive/library/museum work environments)

The MacBook Air line was chosen over the vanilla MacBook or MacBook Pro flavors of Apple laptops because of connectivity considerations – that is, available data ports. Given that the 2016 (and onward) MacBook and MacBook Pro lines have only USB Type-C Thunderbolt 3 ports, connecting to peripherals, external drives, and video projectors already available in the department would have required a host of new cables and adapters for data transfers and video output  – as well as the need to track the location and use of said adapters. Thus, given its USB (Type-A) and Thunderbolt (Mini Displayport) options, the MacBook Air was considered most flexible in terms of backwards compatibility and most efficient in terms of maintaining current department equipment.

The lack of a built-in optical drive in the MacBook Air line is something of a drawback in terms of potentially demonstrating optical disc imaging. However, built-in optical drives were discontinued on all readily-available Apple laptops several years back, in any case. A handful of USB-attached Apple SuperDrives (CD/DVD capable) were also purchased to balance this potential shortcoming.

Intel i7 processors and at least 8 GB RAM guarantee ability to run virtual machines or emulation software at comfortable processing speed/power (e.g. minimum recommended requirements for demonstrating a BitCurator VM).

Higher cost of 512 GB SSD storage (opposed to smaller 256 GB model) was considered worth the investment for the ability to potentially move/manage/analyze larger collections and files locally without fear of running out of storage space. In either case, SSDs will require close management to avoid clutter – storage should be inspected and evaluated at the end of each semester (files to be kept moved to department’s backup storage, excess/test files removed).

 

The Operating System

os-x-el-capitan

  • Mac OSX 10.11 (El Capitan)

Why:

Choosing a Mac OS was a trickier consideration. As of summer 2017, Apple only continues to roll out security updates for OSX 10.10.5 (Yosemite) and above, immediately removing any earlier OS version from consideration – exposing students who may be using these machines for tasks as mundane as web browsing to known OS security flaws is unacceptable. And generally speaking, it is the position of this writer that operating systems on professional workstations should be maintained as up-to-date as possible – beyond the security hazards of using older operating systems on a network (laid bare by the global WannaCry and NotPetya ransomware crises of summer 2017 alone), it anecdotally seems to me that archivists and librarians tend to maintain old systems in order to continue using older, favored software, that may or may not encounter issues with newer OS versions. I feel this approach simply ignores the underlying issue of software rot rather than facing it head-on. Rather than allowing workflows to become increasingly unsustainable to the point that they completely break and require the increased effort of redesign around entirely new systems, regular updating identifies problems in software/environment communication *before* they become full-stop crises, and allows for workflows to be regularly tweaked and improved in an ongoing process (particularly when dealing with open-source software, where the issue can be communally raised and addressed) rather than totally reinvented.

That said – such regular updating and testing has identified some persistent issues with macOS 10.12 (Sierra) that make OSX 10.11 (El Capitan) more appealing for the moment. The OS updates in Sierra were largely based around consumer-driven features: the introduction of Siri and Apple Pay to macOS, increased picture-in-picture mode and tabbing in third-party software, improved iCloud sync, etc. These features have made macOS Sierra a much more demanding OS, in terms of processing power, at no perceived benefit to digital archivists or our students. Other new “features” – such as “improved” download validation that can quarantine archived applications (downloaded in .zip, .tar files, etc.) without notifying the user – have proven actively problematic.

Continued rollout of Apple’s new file system (APFS), compatible only with Sierra and above, will be worth keeping an eye on, especially when 10.13 (High Sierra) arrives in fall 2017. But as long as security updates are maintained, the stability of El Capitan (plus increased functionality over Yosemite) makes it, for my money, the preferable option right now for a digipres Mac machine.

Since all new Apple machines come pre-installed with the latest OS (Sierra), this decision required actively downgrading the laptops. Doing this on a brand-new machine, where there is no need to backup user data, is luckily a relatively simple process (and, with solid-state drives, fast), with plenty of thorough instructions available on the internet.

The Software

Graphical User Interfaces (GUIs)

General Operation:

Screen Shot 2017-08-07 at 11.05.23 AM.png

Open source web browser preferred. Given that these laptops were made available to the department of Cinema Studies and the MIAP program specifically under the idea that students should not face any technological barriers to their education in this program, I believe we have a responsibility not to ask students trade their privacy for that privilege. While we do not control the privacy policies for use or administration of NYU’s network or NYU’s Google Apps for Education, we control this hardware, and as is the case in public libraries, have a responsibility to at least maintain privacy via physical access (i.e. multiple students potentially using the same device) and limit outside tracking of our students to the best of our ability and advocacy.

Thus, our Firefox browsers (set as the OS default browser) come pre-installed with several open-source extensions and preferences designed to enhance web security, limit third-party tracking, and automatically clear personal data like browsing history and passwords.

Hidden scripts activated at regular intervals (after any overnight/project checkout, or once a month, whichever comes first) by an admin account (AppleScripts borrowed/tweaked from the staff at Bobst Library’s Computer Center! Thanks, guys!) also automatically securely delete files from most common user directories: e.g. Downloads, Documents, Desktop, etc.

Screen Shot 2017-08-07 at 11.05.34 AM.png

Open source software preferred. Flexible but very light source code editor,  good introduction for new users to simple text/markdown/HTML/XML editing

Screen Shot 2017-08-07 at 11.06.28 AM.png

Open source, flexible hex code viewing/editing.

Screen Shot 2017-08-07 at 11.11.19 AM.png

Robust, proprietary software suite for the creation and manipulation of XML-based data. Used for instruction and homework in Metadata course. Made available via academic licensing.

Screen Shot 2017-08-07 at 11.06.57 AM.png

Open source software preferred. Maintains local compatibility with Google Docs, Sheets, Slides (used by many students and courses via NYU’s Google Apps for Education) for basic word processing, spreadsheets, presentations, etc.

Screen Shot 2017-08-07 at 11.10.15 AM.png

Powerful and flexible program for database design, data management, and cataloging. Used for instruction and a major assignment in Metadata course. Proprietary, licensed; Filemaker’s move to cloud-based download and licensing structure, rather than via physical media (install disc) makes our usual model of temporarily lending students licensed copies of the software for their own computers increasingly untenable to manage. Investigation into a suitable, long-term open-source alternative for the Filemaker assignment is ongoing.

Screen Shot 2017-08-07 at 11.06.50 AM.png

Software for designing, deploying and manipulating MySQL databases. Standard Editions of MySQL-branded software (Server, Workbench, etc.) have been closed/proprietary since being bought by Sun Microsystems (now Oracle), but the foundational software (lacking some features/modules) is currently made available via open-source “community editions” buried on Oracle’s site. Potential alternative to use for the Filemaker assignment in Metadata course, but for now at least offered for students to explore.

Screen Shot 2017-08-07 at 11.07.32 AM.png

Open source software preferred. Flexible media playback software that maintains compatibility with most consumer and even many preservation-grade video formats and codecs for QC and access.

Screen Shot 2017-08-07 at 11.07.21 AM.png

Open source alternative to built-in Mac “Archive Utility” app for opening zipped/archived/compressed files – and more powerful (can extract from a much wider variety of archive formats, e.g. .rar or .7z, and can better handle non-English characters).

Screen Shot 2017-08-07 at 11.06.14 AM.png

Open source GUI for the rsync Unix command-line utility. The Mac port of Grsync hasn’t been actively developed since 2013, but Grsync is a stable and intuitive cross-platform interface that makes crafting rsync scripts easier for beginner or command-line shy users.

Screen Shot 2017-08-07 at 11.11.43 AM.png

FUSE for OSX is an open source file system compatibility layer that allows for read/write mounting of third-party file systems in OSX – combined with the NTFS-3G driver (installed via Homebrew), allows NTFS (Windows)-formatted disks to be mounted with both read AND write capability on a Mac (out of the box, OSX will usually mount and read/display files on NTFS disks but can not write/change data on them).

Screen Shot 2017-08-07 at 11.11.48 AM.png

Necessary Java libraries for running some of the Java-based GUI tools listed below.

 

Metadata Characterization/Extraction:

Open source software that extracts and displays technical metadata and text from a huge range of file types (critically, including non-AV material like PDFs, PPTs, DOCs, etc. etc.)

Open source tool for extracting metadata and creating reports on the files contained within disk images. This is a GUI based on top of the brunnhilde.py command line tool; may also be installed within the BitCurator virtual machine (see below). [requires some command line parts to be installed first – read through the instructions in the link]

Free and open source software for batch identification of file formats, using the PRONOM technical registry.

Screen Shot 2017-08-07 at 11.06.44 AM.png

Open source software that displays basic technical metadata embedded in a wide variety of video and audio codecs.

 

Transcoding:

Screen Shot 2017-08-07 at 11.06.22 AM.png

Open source video transcoding software. Perhaps most commonly used for ripping DVDs but also useful for creating web or disc-ready access copies of video files.

Screen Shot 2017-08-07 at 11.13.34 AM.png

Alternative free (but not open) video conversion and editing software. Transcodes and demuxes a variety of formats (not, despite its name, just MPEGs), with some more sophisticated clipping/editing options than Handbrake.

 

Virtualization + Forensics:

Screen Shot 2017-08-07 at 11.07.27 AM.png

VirtualBox, by Oracle, is a free and open source virtualization platform, allowing for a variety of contemporary and legacy guest operating systems to be run safely on a modern host. In this instance, VirtualBox is used to run the BitCurator Linux distribution (derived from Ubuntu) – in essence, a comprehensive suite of of forensics and data analysis software aimed at processing born-digital materials.

The full list of tools found in the BitCurator environment can be found on their wiki. In addition to the tools on this list, two additional pieces of Linux-only command-line software were installed (via Ubuntu’s CLI package manager, apt-get): dvdisaster, an optical disc data recovery tool; and dvgrab, designed to capture raw DV streams from from tape.

Edit (8/9/17): dvgrab is a great tool, but the FireWire connections necessary to hook up to DV tape decks will not pass through VirtualBox into a virtualized OS. So installing dvgrab in this particular setup was essentially moot. But, leaving the mention of it here because it would still be useful for any digipres-oriented machine running a dedicated Linux/BitCurator installation rather than a virtualization.

 

Packaging and Fixity:

The Library of Congress’ free and open source GUI for packaging and transferring files according to the BagIt specification.

Simple GUI tool developed for quick accessioning of digital media (including assigning unique identifiers, basic Dublin Core metadata, and checksum validation) into a repository prior to more detailed appraisal and description.

Utility for regular review and checksum validation of files in a given directory. Can email regular reports to the user on status of files in long-term storage.

 

Validation and Quality Control:

Free and open source tool for adding, extracting and validating metadata within Broadcast Wave format audio files.

Quality control and reporting tool for extracting and examining metadata from DV streams, either during reformatting from tape media or post-capture analysis.

Open source framework for file format identification and validation. (Identifies what format a file purports to be, and confirms whether or not it validly conforms to that format specification)

Screen Shot 2017-08-07 at 11.06.38 AM.png

Extensible, open source policy checker and reporter, currently targeted at validating audiovisual files conforming to preservation-level specs of Matroska (wrapper), LPCM (audio codec) and FFV1 (video codec).

Screen Shot 2017-08-07 at 11.07.03 AM.png

Now a volunteer, open source project (formerly funded by Google and known as Google Refine) for cleaning and transforming metadata.

Screen Shot 2017-08-07 at 11.07.11 AM.png

Free and open source option for in-depth inspection of digital video signals; intended to assist in identification and correction of common errors during digitization of analog video. GUI is aimed more at in-depth analysis of single files (see command line tool for batch processing).

 

Command Line

 

Screen Shot 2017-08-07 at 11.07.51 AM.png

The “missing” package manager for Macs. Allows easy install and use of a huge variety of open source command line software. Unless otherwise noted, all programs below installed via

[cc lang=”Bash”]$ brew install [packagename][/cc]

But first you should

[cc lang=”Bash”]$ brew tap amiaopensource/amiaos[/cc]

to allow easy download of a number of useful programs and libraries made available from the Association of Moving Image Archivists’ Open Source Committee!

The Mac OS Terminal application, which is how most users access and use command line software, by default uses version 3.2 of the Bash shell, whereas the most recent stable version of Bash is all the way up to 4.4. The difference won’t be noticeable to the vast majority of users, particularly regarding common digipres tasks, but users or students interested in more advanced bash scripting may want to take advantage of the shell’s newer features. Once Homebrew is installed, updating the Bash shell is a very easy process, outlined in the link above.

 

File Conversion and Manipulation:

Extremely powerful and flexible media transcoding and editing software. Use of ffmprovisr as a guide to just *some* of the possibilities highly recommended!

  • imagemagick

Similar to ffmpeg, but targeted specifically at transformation and transcoding of still image files.

  • sox

Sound processing tool for transcoding, editing, analyzing, and adding many effects to audio files.

  • jq

Utility specifically for manipulating and transforming metadata in the JSON format.

 

Metadata Extraction/Characterization:

  • exiftool

Reads embedded and technical metadata of media files, especially powerful/helpful with regard to still image formats.

  • mediainfo

Command line tool for displaying and creating reports on basic technical metadata for a wide variety of video and audio formats.

Signature-based file format identification tool that draws on several established registries (PRONOM, MIME-info). [special installation instructions required; see link for details]

  • tree

Basic utility (ported from Windows) that recursively displays all files and subfolders within a directory in a human-readable format – a nice companion to the basic “ls” command.

 

File transfer:

  • rsync

Fast, versatile tool for securely transferring or syncing files, either remotely (to a server) or locally (between drives).

  • wget

Allows for downloading files from the internet (via HTTP, HTTPS, or FTP) from the command line.

  • youtube-dl

Command line program from downloading media off of YouTube and other popular media hosting web platforms.

 

Validation and Quality Control:

  • md5deep

Allows for computing, comparing and validating hashes for any file, according to a number of major checksum protocols (md5, SHA-1, SHA-256, etc.)

  • mediaconch

Command line policy checker and reporter for validating Matroska, FFV1 and LPCM files.

  • qcli

Command line tool for generating QCTools reports. Aimed at batch processing reports for a number of video files at once.

 

Disk imaging:

  • ddrescue

Smart data recovery tool, can be life-saving for dealing with “dead” and failing hard drives.

More advanced versions of AccessData’s Forensic Tool Kit software are available via commercial licensing, but the command line version of their FTK Imager utility for making forensic disk images (raw or E01 format) can be downloaded for free [follow the link to appropriate product and system spec – you will have to provide some basic information and will be emailed a download link]

  • libewf

Library necessary for handling EWF forensic disk images (Expert Witness disk image Format).

 

Miscellaneous:

A package of scripts developed at CUNY TV for many useful tasks and batch processing related to audiovisual files. Utilizes several of the other command line programs mentioned here, including ffmpeg. Follow the link for installation instructions and more details on all the scripts available and what they do!

  • ntfs-3g

A read/write NTFS file system driver. When Fuse for OSX is installed, you can use the ntfs-3g command to manipulate files on NTFS (Windows)-formatted disks on Mac OS.

Python packages:

Python is a very common programming language, and its library (2.7.10) comes pre-installed on Mac OS. You can install some Python-specific tools and packages that are not available via Homebrew, but you will need Python’s own package manager – a tool called “pip”. Follow the link above for instructions on adding pip to the Python installation on your computer; once that is successful, for the following three packages, you can simply install them by typing:

[cc lang=”Bash”]$ pip install [packagename][/cc]

  • bagit

The currently-favored command line implementation of the Library of Congress’ bagit software for packaging files according to the BagIt specification. [see this post for some more context on what that all is about]. Invoke in the command line using  [cc lang=”Bash”]$ bagit.py [flags+arguments][/cc]

  • brunnhilde

Command line tool for extracting metadata and creating reports on the files within disk images. Requires the “siegfried” tool to have already been installed via Homebrew. This utility must be installed to run the Brunnhilde GUI app. Invoke in the command line using [cc lang=”Bash”]$ brunnhilde.py [flags+arguments][/cc]

  • opf-fido

Package name for FIDO (Format Identification for Digital Objects), a command line tool for…uh… identifying file formats of digital objects. Uses the PRONOM registry. Invoke by just using [cc lang=”Bash”]$ fido [flags+arguments][/cc]

 

For fun!

Just to add a little more flavor to your command line education. Try ’em and find out.

  • cowsay
  • ponysay
  • gti
  • sl
  • cats [requires some additional installation – follow instructions in the link!]

 

 

* It’s a practical reality that basically every grad student coming into the program already owns their own laptop, and would generally prefer to use their own machine during class – especially since in-class work frequently ties into take-home assignments, group work, etc. I therefore have not yet designed any sort of official loan policy for these laptops to leave the classroom. But it has also proven true, especially in our Digital Literacy and Handling Complex Media classes, that there are times for lab exercises that it really really assists with classroom prep for there to be a guarantee that everyone in the room will be on the same page.