The structure of content and metadata

I was explaining what a database is to my gf the other day (she got it in about 90 seconds), and I've been thinking about structured content a lot lately. Here are some thoughts (some extremely basic).

Unstructured content.
Anything you type in your word processor program is structured content, unless you are one of those thousand monkeys typing for a thousand years (and even then). It is structured because you write sentences, paragraphs, you draw relationships in your head between things you write and assign meaning. The problem is that the computer doesn't know that - it doesn't know what is the title of the piece for example - so it can't do much with this structure in your (or the readers') head. That's why we call this unstructured content: it is unstructured for the computer. Most of the following structures are attempts to structure things for computers. None of these structures (even ontologies) can capture all the finesse and subtleties of the structures in our heads (most structure exists in our heads, not in the world out there).

Metadata.
Not just data about data, metadata is really defined by its use. If you use it like metadata, it is metadata. Sometimes data can be metadata in some circumstances, and plain data in other circumstances.

An ordered list
An ordered list consists of elements in a certain order. Separating things (into elements) is useful for the computer, because know it knows there are more than 1 thing, not just a big Word file, and it can start doing nice things for you, like sorting these elements alphabetically. The computer remembers the order you put things in even though it doesn't understand that order (it's not anything logical like an alphabetical ordering), because the order may have meaning to you (the most important things are at the top).

Databases.
A relational database is different from a list, not only in that you can manipulate (like sort alphabetically) things easier than if you were to just type a list in a word processor, but also in that things are related to each other (ie., a person is related to an address, or a product is related to a price, so you can list all products of a certain price for example).

XML
An XML document also has structure, but isn't particularly relational. Imagine an article with a title, an author, a header, an introduction and the main body (divided in paragraphs). If you put tags (much like HTML tags) around all of those (and follow some rules), you have XML. XML is great at structuring content (and even better at exchanging stuff between applications), but not particularly good at relating content like a relational database does. Structure and relationships are the two basic elements we are discussing: they are different things. Get your head around them.

Ontologies
Ontologies really expand the relational model: not only are things related in complex ways, they are related in different ways: there is more than one type of relationship. You don't just draw an arrow between Peter and information architecture, you say: "Peter has as profession information architecture". Once you build a complex model like this, complex programs can take advantage of that information, and, for example, if you are looking for information architects, the program would know that Peter is the person you should talk to. The problem with ontologies is that they are so darn, well, complex. They are hard to get your head around, hard to create and especially hard to write programs for because they are so flexible. Many ontologies start by creating a limits: someone decides to only use these relationships and these types of elements. Often this is directly related to how this information will be used in the interface, although purists say that you should create an ontology without worrying how the information will be used (I disagree).

Topicmaps.
A topicmap is a structure in which you can build ontologies. A topicmap provides a standard structure (topics, relationships and occurrences) and a technical environment (an XML langauge to express your topicmaps, a query language, tools, ...). So it is easier to build ontologies with topicmaps because a lot of the complex, hard work has been done already. A key advantage of topicmaps is that they have merging capabilities built in - a very useful feature. Topicmaps are cool, but haven't taken off in a big way yet. I believe that will change within a year or two, although the fact that topicmaps decrease lock-in effects means that adoption by vendors of corporate technology will be problematic.

Taxonomy.
A taxonomy is a word that is used differently by people with different backgrounds, so in this discussion I will use it in a generic way. A taxonomy is a tree-ish structure in which you can put metadata. Not the nicest of definitions, I realize :)

Topics/terms/nodes.
In a taxonomy, you have terms/topics/nodes. A topic is something that exists, but can have different words to describe it. A term is a word (or more than one word). A node is a term used by programmers to describes leaves (another term used by programmers) on the tree.

A tree taxonomy.
A tree taxonomy is the structure most website are organized in. All nodes (or leaves) have one parent.

A polyhierarchical taxonomy.
A tree where nodes can have two parents. These structures are often nessecary when creating large trees in which to organize things - such is the nature of classification. (Yahoo is an example) Polyhierarchy means you can classify things better so it will be easier to find stuff for people, but is somewhat harder to implement (both in the backend code and in the interface).

A faceted taxonomy.
A faceted taxonomy consists simply of multiple tree taxonomies, used together, with the rule that the individual taxonomies should be exclusive, ie. that a topic/term in one facet cannot possibly belong to another facet. Faceted taxonomies happen to be one of the structures that are extremely useful on the web, because we have found ways to build interfaces around them that people find easy to use.

Classification.
The act of saying: "This thing belongs to this category (for example subject, or location)". Classification is subtly different from assigning properties, where you say: "This thing has this property (for example creation date)". There is some overlap between classification and assigning properties (you could say assigning an author is a giving something property or a classifying it as being written by the author).

Classification systems.
Most websites will have multiple taxonomies used for various purposes. The combination of all these is called a classification system.

A controlled vocabulary.
We are going to get subtle for a second: a controlled vocabulary isn't so much about classifying things or assigning properties. It deals with the things within the system, called terms. Terms are words (or groups of words). That's why it's called a vocabulary. A CV controls the use of terms. There are various types of CV's. A simple example is the synonym ring: Term A = Term B = Term C. You can see, a simple structure that controls the use of these terms. You can make this more complex by saying: Term A is preferred (you should use that instead of the other terms). CV's are often used to improve search engines.

A thesaurus.
An even more complex CV. A classic thesarus has this structure: central is a preferred term, which can have a parent (a 'broader term'), siblings ('variant terms'), children ('narrower terms') and related terms. Some people think this type of thesaurus is the end-all of CV's, but you can keep expanding the types of relationships: you could define what types of related terms existed. You could add types of variant terms (acronym, latin name (when doing species), ...). At some point, you'd realize you need the ability to keep defining different types of relationships in your model, and you would have created an ontology.

Yeah baby.
If you read this far kudos to you. There are many types of structure and relationships that we can use to design websites. Which ones you choose depends on how you are going to use them. Most structures mentioned above have been found to be useful for web development (ontologies are still rare). There is still a lot of work to be done to identify the best structures for webdesign, to develop interfaces for them and to develop efficient ways of populating them.

By the way, after finishing this I realized I had been inspired by Victor's excellent metadata glossary :)

# Jun 7, 2003