Using taxonomies and ontologies to improve blog search

One of the challenges around reaching a larger audience is that potential readers might use a variety of words to search for the same thing. Semantic processing techniques that leverage taxonomies and ontologies are helping to create rich metadata that can help improve readership or make it easier for people to find relevant information on a blog more efficiently.

The WordPress platform has plugins used for custom taxonomy features which makes it easier to create tags and labels for content. But much richer use of semantic processing techniques are available from various tools that can be leveraged in house or in the cloud from companies like WAND Inc., Alchemy API, and SAS Institute.

Elliot Turner, CEO & Founder of AlchemyAPI, noted that people outside of academic circles tend not to draw major distinctions between the terms “taxonomy” and “ontology,” and use them interchangeably. But he clarified:

“The reality is that they are fairly different.Taxonomies subdivide a topical area in increasingly granular ways via a hierarchical structure. Ontologies can offer a much broader perspective, incorporating axioms, rules, and many other types of information. Ultimately publishers use taxonomies and ontologies to organize content into topical hierarchies that allow people more easily find the information that matters to them.”

Taxonomies complement ontologies

Mark Leher, COO at WAND, Inc., said,

”Taxonomies and ontologies are similar in that they are both sets of concepts within a given topical area that are used to organize information. They can both be thought of as data models to organize information. Taxonomy is a simpler model. Ontology is a more complex model.”

Taxonomies are structured as a tree with concepts connected as either broader or narrower concepts. For example, a taxonomy of furniture, “Bedroom Furniture” would have narrower terms (or children terms) of “Beds” and “Nightstands.” “Beds” would have a further child term of “Sleigh Beds.”

Lehrer said ontologies allow for more complex modeling of the relationships between the concepts, well beyond broader and narrower relationships. If taxonomies are trees, ontologies can best be represented as webs. For example, “Nightstands” could have a relationship to “Oak” of “Is Made Of” and “Drawers” could have a relationship of “Is a Part Of” to “Nightstands.”

In both cases, it’s common to have synonyms to make sure that different ways of saying the same concepts are connected. For example, “Laptop Computers” has a synonym of “Notebook Computers.”

Taxonomies are perfect for simpler needs like providing a browsable structure for navigating a website. For example, Amazon uses a taxonomy to organize its product pages, said Lehrer. Ontologies are more important when you want to get into analytics or question answering.

A furniture taxonomy would be great if you want to help a user find a nightstand to purchase on a website, noted Lehrer. A furniture ontology would be great if you wanted to answer a question of “what are nightstands made of?”

Improving blog search

Lehrer believes that taxonomies are used primarily in two ways for blogs: websites and portals. First taxonomies can improve the menu navigation of the website. Most menu navigation is going to start with broad concepts and then allow the user to drill to more specific concepts.

Taxonomies can also be used to support keyword search of unstructured content, added Lehrer. A user can do a keyword search for whatever they want and get a set of results. Then, on the left hand side, taxonomy terms (which the content has been tagged with) can be presented to allow the user to narrow/filter the result set based on the concepts within the taxonomy. “This is a great supplementary tool to use to enhance the relevance of a search engine,” said Lehrer.

In contrast, ontologies are used more frequently to enable more sophisticated search queries. For example, a Google search on “Who was the 16th President?”brings up a picture and a brief bio of Abraham Lincoln. This is not a website results and is instead a direct answer to the question. “This type of search result is likely powered by an ontology in the background,” explained Lehrer.

When taxonomies and ontologies are applied, they comb through unstructured text data to identify, extract, and categorize blog, website, and portal content, explained Fiona McNeill, principal global marketing manager for analytics at SAS Institute.

The tools essentially declare if a concept in the taxonomy exists or not, or if an ontological relationship exists or not. That declaration creates new metadata about the blog, website, or any other kind of text content—giving it new tags, that can be included in search and retrieval routines.

McNeil said,

“Such metadata is extremely useful for disambiguation—to quickly converge on content and delineate the writer’s intent. For example, distinguishing the term ‘jaguar’ as a vehicle vs. an animal. This additional metadata can be represented in similar fashion to the Wikipedia info boxes for the end user.

More about the content is understood from the new, derived metadata (based on the content itself) and queries that are made against this enriched set of tags/indexed metadata drive better, more relevant results.”

Going beyond tags

The use of tags can greatly simplify the ability to search through blogs for relevant content, but are difficult to maintain as the number of tags grows. This limits their precision and reach into the depth of blog content. AlchemyAPI’s Turner explained, keywords provide utility in organizing information, but have some key limitations.

Namely, there are many different keywords which can refer to the same topic (for example, “dogs”, “canine”, “pooch”, etc). A software system relying on keywords alone will ultimately look upon each of these words as a different topic, whereas a system leveraging taxonomies or ontological knowledge can collapse these different words into the same topic (“dogs”).

Taxonomies and ontologies can also help identify similarities between topics that are not directly synonymous like “basketball” and “football,” Turner added. While these words each refer to a different activity (basketball and football), they’re both referring to a type of sport—a relationship that would be captured in the knowledge hierarchy provided by a taxonomy or ontology.

This idea is similar to a family tree. If we go back far enough, we find that we are all related somehow. This same analogy applies to how content is tagged. If you walk up the tree, you’ll find how terms are connected.

These concepts are particularly important for improving a blogs stickiness and engagement Turner said.

“Goals of publishing websites are time-on-site (stickiness) and engagement with content and advertisements. A publisher wants to deepen their relationships with readers by recommending content of interest that keeps them on the site and gets them to look at more pages (and more ads).

Ultimately, as publishers dive into long-tail keywords, or more specific phrases that buyers are likely to use when they are closer to a purchase, we see taxonomies and ontologies going above-and-beyond what surface level tags can do.”

One of the main challenges large-scale publishers face is with hundreds or thousands of authors all generating content, how does one provide meaningful recommendations and cross-navigation across all of those articles? One approach is to try and force authors to abide by certain tagging policies. Such policies require lots of training and development of strict standards, and such a process simply doesn’t scale for open publishing platforms, noted Turner.

He explained,

“We think a better approach is using a taxonomy to automatically ‘walk up’ categories to find common links to ideas expressed in slightly different ways by different authors. This allows a publisher to dive deeper into their site’s inventory and pull highly targeted content that will interest a reader. In turn, this increases the value of content that’s out there over time. Instead of an immediate peak of interest when the article is first published followed by a steep drop-off, we see longer content lifespans that increase the revenue generated from any single article.”

Leveraging tools for taxonomies

There was a time when people worked manually to construct hierarchical representations of topical domains. In some niche areas with topics that do not change often, this process can work. However in most industries, topics change often and cannot be constantly monitored by people. Turner noted,

“We’ve seen a big shift of companies moving towards automated tools such as AlchemyAPI or Synaptica that can create taxonomies on the fly without human intervention.”

Standards for many taxonomies often center around how content is represented in the advertising space. For example, the Internet Advertising Bureau’s Quality Assurance Guideline taxonomy is a common structure to allow different parties to normalize their tagging and exchange data effectively.

Non-standardized methods like Google’s advertising are widely embraced by content managers who rely on search engine optimization and organic search.

Internet content providers are also starting to leverage web standards for representing semantic web data as well like OWL, RDF, and SKOS, said WAND’s Lehrer. He sees blogs and websites also leveraging a variety of tools for managing taxonomies and ontologies including Synaptica, Smartlogic Semaphore, Expert System, and PoolParty.

Lehrer added, “Some text mining tools have good taxonomy management built in, but in many cases a company would be well served to have a standalone ontology management tool that can publish the ontology to the mining or analytics tool.”

Challenges of taxonomies

One of the challenges is that many tools of these tools need to be trained on text data before they can be efficiently used, said SAS Institute’s McNeill.

To address this problem SAS has developed a unique method of applying machine learning techniques to discovery initial semantic relationships, and then use those to help delineate categories and concepts.

McNeill explained,

“The historical hurdle has been the need to train them on data, or manually classify a sample of documents to create a training set upon which text models are built. However, advances that SAS has made in this area with SAS Contextual Analysis alleviate this historic requirements.

This makes it possible to use text mining to develop a starter taxonomy that is directly refined with linguistic rules, within the same software interface. SAS Text Miner also provides methods of active learning whereby analysts can start with sparse training but they can refine their models through relevance feedback that teaches the model with every refinement.

Another challenge is balancing the need to create the complexity necessary to get interesting insights during text mining with the need for a simple way to present the information back to users, said Lehrer. He explained complex ontologies are better for mining information, while simpler taxonomy trees are better for presenting information. So, finding that balance is a challenge.”

Taxonomies and ontologies can reflect the biases, world view and experience of the people creating them, said AlchemyAPI’s Turner.

“ If the content or article is discussing a hobby or sport that is relevant in one region but not another, you will likely receive no result or an incorrect result.”

Also taxonomies and ontologies suffer challenges around recency. Turner explained

“New ideas, memes, words and content are being created daily and old ways of tagging may not pick up on new terms or tag it correctly. Ultimately, a system needs to recognize relationships and place these into a related category that makes sense to readers and grow over time to respond to those dynamic changes.”

Movement to automated solutions lightens the weight of these challenges and allows humans to provide guidance while relying on machines that are better equipped for large scale content categorization. For example, one group of engineers looking at human created taxonomies found that “Kings of Leon” was categorized as “Dignitaries and World Leaders” instead of “Music” or “Band.”

An automated taxonomy would read text about “Kings of Leon” and understand over time that it needs to put this into a music-related category, but the people who built this particular taxonomy had not yet heard of the band and thus it was incorrectly classified.

The future of auto-tagging

“As the amount of content continues to grow, having good metadata and a good metadata structure (meaning a taxonomy) will continue to be more important,” said WAND’s Leher. “If a blogger tags content and organizes those tags, it will make it easier for readers to navigate through content on the blog. I hope and expect that blogging platforms will improve support for taxonomies beyond simple free-form keyword tagging.”

SAS Institute’s McNeill expects that new tools for blog search will eliminate the need for traditional pre-definitions of taxonomies and ontologies altogether and will allow linguistic models to evolve with content.

Past ways of building and maintaining taxonomies and ontologies were confusing, complex, and required expensive domain experts. AlchemyAPI’s Turner predicts, “As the market grows and early-adopters prove the usefulness of automated systems, we see many industries approaching this problem with newfound energy and excitement.”