Metadata madness

Yesterday I woke up sucking a lemon.

It seems that UTIs are in the news again. It all started with a change in application binding in Snow Leopard. In a scant few weeks it’s degenerated into a sometimes-angry bout of cross-blog debate. I have an opinion about the changes in Snow Leopard, and I’ll get to that eventually, but my main goal is to clarify the issue. It’s really not that complicated, and seeing all the confusion on the web has been disheartening.

Type and creator

At issue are two pieces of file metadata: type and creator. (If you don’t know what “metadata” means in this context, you can read all about it and come back when you’re done.) I was going to explain these using an analogy, but screw it, this should be simple enough to understand directly.

Let’s start with the simplest one: creator. This is the application that created a file. Presumably, all files were created by some application, but if not, or if the creating application is unknown, then creator metadata may not be present. But if creator information is present, this is what it’s expressing: “This file was created by application XYZ.”

A file’s type describes the structure of the content. Is it a JPEG image? A plain text file? An MP3 audio file? It’s possible to get confused about this piece of metadata because some file types are proprietary and are named after their creating application, e.g., “Adobe Photoshop document” and “Microsoft Word document.” Those are file types because they’re names for particular arrangements of data within a file. “File format” is a synonymous term that emphasizes the “internal structure” aspect of file type.

So that’s the abstract. Now for the concrete. How are type and creator expressed? In classic Mac OS, each was represented by a four-byte value usually expressed as four characters. These are the fabled “type and creator codes.” For example, TEXT is a type code for text files and ttxt is the creator code for the TeachText application. Type and creator codes are stored in the file system metadata structures of HFS/HFS+ volumes alongside things like creation and modification dates. (Mac OS X emulates their storage on non-HFS/HFS+ file systems.)

With the advent of Mac OS X, Apple introduced a new way to identify applications, and therefore a new way to express a file’s creator. Each Mac OS X application has a “bundle identifier” (also sometimes called “application id”) in the form of a Java-style reverse-DNS string. For example, the bundle identifier for iTunes is com.apple.iTunes. This is a lot more expressive than classic Mac OS creator codes and has a much lower chance of accidental name collisions.

In Mac OS X 10.4 Tiger, Apple introduced a new way to identify file types: Uniform Type Identifiers, or UTIs. Since file types are all about the structure of the data itself, UTIs are also used to identify data in memory (e.g., on the clipboard). I wrote extensively about UTIs back in 2005.

To review, we have now two ways of expressing a file’s type, type codes and UTIs, and two ways of identifying a file’s creator, creator codes and bundle identifiers.

Application binding

Now we come to the crux of this debate: application binding. That is, what happens when I double-click on a file in the Finder? How does the OS decide which application to use to open the file? The part of Mac OS X that makes this decision is called Launch Services.

Launch Services decides which application to use to open a given file based on an application binding policy. This policy can technically be based on anything: time of day, phase of the moon, the first letter of the file name, etc. But let’s say we want the application binding policy to take into account a file’s type and creator. Given a file on disk, how do we get that information?

We’ve already established that Mac OS X has at least two different ways of expressing a file type’s and creator, so we’re immediately faced with the task of prioritizing those representations. If Launch Services can get both a four-byte classic Mac OS type code and a UTI for a given file, which should it use? The values themselves will be different (e.g., TEXT versus public.plain-text) so it’s not even clear if they agree, let alone if one is more precise than the other. The situation is similar with creator codes and bundle identifiers.

There is no one “right” answer. It comes down to an old-fashioned policy decision. A well-chosen policy will do what most people expect, and a poorly chosen policy may be confusing and even frustrating. But in the end, Apple is faced with the daunting task of matching a single policy with thousands of different mental models of “how computers should work.”

For the sake of argument, and since were all “technology enthusiast” here, let’s decide that Launch Services is going to ignore classic Mac OS type and creator codes and instead only deal with file type and creator information in the form of UTIs and bundle identifiers, respectively.

Given a file on disk, Mac OS X provides a way to get a UTI representation of that file’s type. This may lead you to believe that, like the classic Mac OS type codes, there is a storage location somewhere in the file system for the UTI of each file. This is not the case. Instead, Mac OS X derives the UTI from other information, primarily—and tragically—from the file name extension.

But let’s move on, because it only gets worse. How do we get the bundle identifier of the application that created a file? Like UTIs, there is no dedicated per-file storage location in the file system for this information. Unlike UTIs, there is also no way to derive this information from a file on disk. It’s just plain absent.

Moreover, if any policy is to be based on these particular representations of file type and creator, there must be some way for a newly created file to be assigned these values. Since UTIs are primarily derived from filename extensions, applications have a way—albeit a heinous and barbaric one—to add this information to a file. But there is no Mac OS X API for assigning the bundle identifier of the creating application to a file.

Well, so much for our forward-looking Launch Services application binding policy. Though we can get file type information in the form of a UTI, it’s not stored directly, and its derivation is based on another policy that may change independently. And we can’t get the file creator information in the form of a bundle identifier at all.

This situation at least partially explains the historic Mac OS X application binding policy that referenced all available information: type and creator codes, file name extensions, and even a dedicated, per-file "ultimate override" field. The “legacy” type and creator codes may have been deprecated since Mac OS X 10.0, but when they’re the most reliable data available on disk, they tend to win by default.

The big change in Snow Leopard is that Launch Services no longer references file creator metadata at all in its application binding policy. There remains no way to assign or retrieve the bundle identifier of the application that created a file, and launch services no longer looks at the classic Mac OS creator code.

My opinion of this decision aside, it’s important that people understand the parameters of the debate. There are three very distinct things at play here. There are the abstract concepts of file type and creator. What kind of data does this file contain? Which application created this file? Then there’s the concrete representation of this information: classic Mac OS type and creator codes, UTIs, and bundle identifiers. Finally, there’s the Launch Services application binding policy which may or may not reference any of these pieces of information when determining which application to use to open a file when it’s double-clicked in the Finder.

Let me repeat that list because it’s important: file type and creator as abstract concepts, their concrete representations, and the application binding policy based on them. Concept, representation, policy. These are three separate things. Conflating them leads to misplaced anger, unreasonable demands, and unhelpful recommendations.

Everything in its right place

Here’s my take on the Snow Leopard changes, application binding, and file metadata in Mac OS X in general.

  1. The classic Mac OS representations of file type and creator information should be retired. Those four-byte strings are vestiges of a bygone era when memory and disk space parsimony were more important than clarity and flexibility. The APIs for getting and setting these values and the file system storage locations for them can remain indefinitely, but their use should be heavily discouraged and removed from all Apple frameworks and applications.

  2. Apple should add an official storage location for file creator information stored as a bundle identifier, plus APIs to set and get this value. All Apple frameworks and applications should set this value appropriately when creating or modifying a file.

  3. Apple should add an official storage location for file type information stored as a UTI, plus APIs to set and get this value. All Apple frameworks and applications should set this value appropriately when creating or modifying a file.

  4. The default Launch Services application binding policy should reference both file type and creator metadata, preferring the “modern” representation of each and falling back to the classic Mac OS representations in their absence. The per-file “ultimate override” setting should be considered first. If that is not set, then the file should be opened by the application that created it. If that application is not installed, then the file should be opened by the default application assigned to that file type.

  5. The Launch Services application binding policy should be configurable. Apple could provide a preference pane to do this, or it could merely expose the policy in the form of a property list and let third-party developers create a friendly interface.

  6. The storage location for all of the above—type, creator, per-file application binding override—should be in appropriately named extended attributes.

Most of these are not new recommendations. I’ve wanted the officially blessed representation a file type information in Mac OS X to be something—anything—other than file name extensions since the earliest days of Mac OS X’s development. When UTIs arrived in Tiger, they were so clearly superior to both filename extensions and classic Mac OS type codes that it was all the more tragic that they remained a derived, in-memory concept only.

The change in application binding policy in Snow Leopard pales in comparison to these earlier, more fundamental sins. Policy decisions can be reversed instantly in a point release. Official representations, storage locations, and APIs for getting and setting file metadata take much longer to create, document, and achieve widespread adoption. Poor early decisions in these areas are now coming to roost. Apple clearly wants to close the door on type and creator codes, and I don’t blame them. But that can’t be done cleanly without putting their modern replacements on equal footing in terms of storage and APIs.

As for policy decisions, Apple continues to have a tin ear in this area. At best, we can hope for some timely backpedaling. At worst, not enough people care and we’ll be stuck with a crappy, non-configurable application binding policy for years to come.

Anyway, I hope we’re all on the same page now. You may disagree vehemently with one or more of my recommendations, and that’s fine. As long as your desires and frustrations are expressed in the context of some minimal understanding of the concepts described above, I will declare victory.

This article originally appeared at Ars Technica. It is reproduced here with permission.