Tuesday 17 May 2011

Are search engines and content aggregators stealing from content providers?

Content creations and monetisation

The demand for content on the internet is insatiable, whether for news, fiction, opinion pieces, movies, or music. Those who create this content often want to monetise it either through pay-walls or advertising. Those who consume the content often want it as cheaply as possible or even free. Very popular sites often try a mix of "freemium", pay-walls, advertising and sometimes simply appeals for donation.

Tension between creators, search engines and content aggregators

Making content easy to find is a cause of tension between the content creators, search engines and news aggregators. On the one hand, the content creators want their content to be discoverable which means being featured in search engines and possibly having RSS feeds of articles or other form of syndication, but on the other they may not want to allow entire articles, or significant part, to be indexed and stored and thus lose the change to control access via paywalls or display adverts.

Search engines

Most people are aware of the spats between content providers and search engines, so I will tackle that first.

For example, if a search engine scans and caches the entire page, a person finding it may realise they can access the cache rather than have to login through the paywall. The good news is there are technical measures whereby the content provider doesn't have to block the search engine entirely, it can detect their spider and IP addresses and provide the article text yet tell them not to cache the article.

It's therefore reasonable to suggest that search engines don't steal content by anything other than careless misconfiguration by the content provider, and even then it is possible to have the cached content removed from the search engine if required.

Sometimes a content provider can make amazingly stupid decisions about controlling search engine access, one of the best was when Belgian newspapers decided to be removed from Google and then relisted months later!.



Content Aggregators

A growing concern for content creators is their loss of control due to content aggregators who as an intermediary between consumer and original source. For example, a consumer reading an RSS feed of the article may do so via an aggregator like Pulse, Flipboard or Taptu, and not visit the originating web site at all. With a reduced number of visitors to the web site, the opportunity to display adverts is reduced or even lost, particularly if the aggregator "deep links".

A further problem is that, as intermediary, many aggregators use caching which reduces the hit count on the origin site and feed stale data to the consumer. This skews the statistics for the site, causing them to under-report the level of site activity, this can reduce the value of the site in terms of readership population and diminish the interest of advertisers, cause stale unregistered adverts to be shown, and also prevent the site registering advert fill-rates.

Is there an answer?

The aggregator could provide their statistics to the content provider, if an arrangement could be made, assuming such statistics were kept and were useful. Also in theory the aggregator could artificially "hit" the RSS feed when the consumer refreshed their feed, and likewise "hit" the article to match the count of the consumer reading the article. However, that idea isn't particularly practical, it requires the aggregator to track the consumer (which may not be possible or acceptable to the consumer), and becomes somewhat useless when the consumer uses an offline reader.


Conclusion

I thus conclude that content creators have more to fear from uncontrolled aggregators acting to disintermediate them from the consumers than ever from search engines.

2 comments:

  1. Seems like the Belgian newspapers are sitll a PITA for Google:
    http://blogs.ft.com/fttechhub/2011/07/belgium-proves-a-headache-for-google/

    "Google interpreted the court order to mean it could no longer include the Belgian newspapers in either Google News or in its main index. Le Soir, La Capitale and La Libre were removed from the index on Friday, so that searches for these papers no longer turned up any results.

    The papers were horrified. They wanted exclusion from Google News, but not the main index, which is a huge source of traffic for them. They accused Google of censoring them in retaliation for the court case."

    ReplyDelete
  2. An interesting recent case in Canada found against "screen scraping"
    http://www.michaelgeist.ca/content/view/5996/125/

    ReplyDelete