If you work in news publishing, there’s a good chance you already use some type of tags in your published content. After all, Parse.ly found that 70 percent of digital media publishers use tags and that was in 2015. The number is surely even higher in 2022. But as our research at Geneea has shown, different publishers adopt wildly different approaches to tagging content, with very different results. So whether or not you already use tags, an overview of what tags are and what they can do might be useful. In this post, I’ll try to summarize not only that but also why using Geneea NLP (natural language processing) for tagging might be a good idea.
What are tags?
In the context of news content, tags are a type of metadata. A news article is usually accompanied by metadata, such as the post’s author, timestamp, and info about which of the site’s sections it belongs to (such as news, business, culture, etc.). Tags—sometimes also referred to as topics or keywords—can supplement them.
Before talking about tags themselves, it’s a good idea to briefly discuss the sections that make up a (news) site. Sections form the structure of a news site. The sections can be designed in-house or they can be based on an existing standard. There are several comprehensive classification systems, such as IPTC or IAB, which publishers can use or customize to suit their own needs. These systems consist of a controlled vocabulary of categories organized into a hierarchy of topics. They comprise hundreds of categories, so it makes sense to choose the depth that suits your coverage. A newspaper that covers general news might only use top-level categories (e.g., politics, sport, or science), while a more specialized magazine will delve deeper into their particular field of interest. We at Geneea work with several of these popular categorization systems and can offer customers automatic classification into a customized selection of categories that fit their preferences.
Sections are designed so that any article will fit into one or more of the pre-selected categories, which means that unless, I don’t know, a global pandemic hits, the number of sections used on a site is usually limited. This makes perfect sense for the structure of a news site but it also means that articles about a variety of topics will end up in the same section. It’s great to know that an article belongs to, let’s say, “World politics,” but it’s not exactly enough to tell anyone what the article is about. Is it about the relationship between the US and Russia? Or elections in Germany? That’s where “keyword tags” come in.
“Keyword”, “topic”, or “subject matter” tags are the most commonly used tags. They allow for a fine-tuned classification of articles. Yet, other types of tags can be useful too—for instance, tags based on content type, sentiment, tone, length, etc. Even the so-called “keyword tags” can go into more detail than simple keywords. They can include more detailed information about the particular subject, such as whether it refers to a person or an organization and how it relates to other entities in the real world.
A great thing about tags is that they are metadata accompanying the article, which means they can be used without having to access the article itself. But what exactly can they be used for?
What are tags good for?
Many publishers think of tags in terms of how they will benefit their readers. Keyword tags allow you to easily link related articles so that readers can continue to explore a topic they’re interested in. For example, when there is a new development in the Israeli-Palestinian conflict, you can use tags to quickly find articles on this topic in your archive and create a timeline so that your readers get a better picture of what led to the current situation. You can even offer your readers the opportunity to subscribe to a certain tag so that they receive an email every time a new article on that topic is published. Such features can make your site more pleasant to navigate for readers. But how can tags benefit publishers themselves?
In general terms, tags allow for better analysis. Articles without tags are hard to analyze based solely on main categories like sport, news, or business. Specific tags allow for granular data analysis of readership engagement, improved archive organization, and more. If publishers are interested in readership analytics, they can use tags to measure which topics lead to the most page views, which have the best engagement rates and are getting shared the most, or, conversely, which are losing readers the fastest. Additionally, if your site has a paywall for certain content, tags can be useful for determining what should be free and what should only be accessible to paying subscribers.
Tags can also be useful for advertising. They can help match ads to article topics and avoid embarrassing situations when inappropriate ads are matched to sensitive content. Similarly, tagging sponsored content allows publishers to track performance metrics, and, if relevant, they can then share this data with advertisers. On top of that, tags help with search engine optimization (SEO). Tags create links between articles and give your site a structure that will allow search engines to have a better chance of making the same semantic connections.
Thinking about tags differently in terms of how they benefit readers and publishers means we can consider them as two different types of tags. Simpler, external tags for readers can be selected from more complex, internal tags. As I’ve already mentioned above, tags can contain additional information about the subject which can be used to create taxonomies and link related tags together. For instance, if your tags for Theresa May and Boris Johnson contain the information that they have both served as PMs of the UK, you can then use this information to include both their tags in the same category and in this way create more complex tag hierarchies and a better structure on your site and in your archive. The tags we at Geneea provide to our customers contain a variety of useful information thanks to our robust knowledge base (but more on that later).
Choosing how to tag
Choosing the “right” tags for a news article is not as easy as it may seem. When a publisher doesn’t have a well-thought-out tagging strategy in place, this can lead to problems, as haphazardly applied tags make meaningful analysis much more difficult. Often, articles either miss important tags or have redundant or irrelevant tags. Also, the level of detail can vary greatly. There are some good guidelines that explain how to apply tags properly (you can check out one of these on this site), but even when you have a strategy in place, there may still be trouble ahead. Designing a strategy is one thing; sticking to it in practice is another.
Manual tagging
Even with thorough checks, the fact remains that when journalists write tags manually, they are prone to making mistakes. Too often, manually applied tags suffer from duplicities and typos. When the Süddeutsche Zeitung took a look at their tags, they realized they had a lot of unnecessary duplicated tags. To take just one example, ‘“Chancellor Merkel,” “Chancellor Angela Merkel,” “German Chancellor Angela Merkel,” and “A Merkel” were all tags being used on SZ. Similarly, it’s typical that the same tag is often used in both the singular and plural form (“iPhone” vs. “iPhones”).
Additionally, as no two people are the same, journalists will often choose different tags for similar articles. We at Geneea have conducted our own research into this issue, which showed that the consistency between human journalists tagging the same articles is less than 20 percent. Different journalists use synonyms for the same concept or have different ideas about how specific tags should be. As a result, one journalist may use the tag “human rights,” while another uses “LGBT rights”; one will put “Middle East,” while another uses “Palestine” and “Israel,” and so on. All of these inconsistencies will obviously lead to issues in any analysis or anywhere else where tags are used.
It’s also worth mentioning that tagging is a task that many journalists loathe. So is there another way to identify tags that are consistent, relevant, unique, and contain important details? Fortunately, there is.
Tagging with the help of NLP
The best way to identify tags successfully is to use natural language processing (NLP). NLP software like the one we develop at Geneea is an AI system trained to read texts like a human. Thanks to functions like part-of-speech tagging and parsing, our AI understands syntactic structures not only at the level of one sentence but also between sentences. This means it can decode the entity (whether it’s a person, an organization or anything else) that represents the main subject of an article—even when it’s only mentioned by name once. Similarly, it can give more weight to entities that serve as subjects rather than objects in the text. That way, it goes beyond simply counting how many times someone or something is mentioned in a text.
What’s more, thanks to named-entity recognition (NER), our NLP software can distinguish between named and unnamed (general) entities and, at the same time, it assigns the named entities several types (such as the person, place, etc.). This classification can then be used to prioritize certain entities over others in tags, depending on the desired strategy. The strategy can vary for different types of articles, so, for instance, the software can favor people and organizations in political articles, products such as cars or mobile phones in technical articles, and general concepts in scientific articles. In a way, our NLP AI works with the text like a human, but its advantage is that once it’s given a strategy, it will stick to it, which means it will always choose the same tags for the same article. You can try this yourself in our online demo on your own article.
Another big advantage of our NLP engine is that it’s linked to the Geneea knowledge base (GKB). The GKB currently contains approximately 10 million items from various sources and it grows every day. Thanks to our knowledge base, the AI recognizes one entity even when it’s referred to by different names and can always provide the exact same tag for it (to avoid the above-mentioned issue with duplicates). Our knowledge base can also be used to provide additional information about the tagged entities and their relations to others which makes tags more capable of providing better functions than simple keywords. At the same time, our AI uses the information it has to disambiguate different entities with the same name so that articles about Adam Scott, the American actor, won’t get confused with articles about Adam Scott, the Australian golfer. It’s also no small thing that we can add customer-specific entities and information to our knowledge base so that you get exactly the data you need.
These are but a few of the many advantages of NLP-assisted tagging, so it’s no surprise that more and more publishers are turning to some form of automated or semi-automated tagging. While some bigger media houses can afford to develop their own solution (like The New York Times), for many publishers this isn’t feasible. Fortunately, Geneea is there to provide this service to publishers. Our service is easy to integrate into your CMS and flexible enough to fit your particular needs. Thus it will help you get consistent, highly relevant tags and, at the same time, save your journalists time doing work they often consider tedious—time they can better spend doing things that only people can do (for now).
Are you interested in using automatic tags? Visit geneea.com/media/tagger for more information and see a case study with one of our major customers, Czech media company VLM, to get a better idea of how this can work in practice.