Automated classification: in response to a question about Teragram Categorizer, which combines automatic classification with rule-based classification, Seth Earley wrote this on the SIGIA-L list, which (it seems) isn't being archived anymore which is why I repost here:
"IBM spent a lot of time and energy developing Discovery Server which was supposed to do clustering, automatic categorization and taxonomy generation. The terms were machine generated and needed intervention by human indexers. The algorithms were supposed to learn from changes to categories and manual reindexing but this process tended to poison the algorithms. Training sets needed to be very large and have good data. I co authored a book about the technology (with Wendi Pohs of IBM). The technology was largely abandoned but some of the DNA is now part of IBM's Omni Search."
(I couldn't find the book referenced.)
The common wisdom among information architects has always been that automated classifiers can be useful, but only if your data is fairly clean and structured (news articles are, an intranet usually isn't), or if you put in a lot of work developing rules. Does this still hold, or has the technology evolved?