For publishers, Big Data may only be around as long as print is. It is revered – but should it be revered to the exclusion of all other insight?
Tim Harford, the FT’s Undercover Economist, his articles are generally very worth reading – both in the FT and on his blog. He is also to be found on Radio 4 on hosting a show called More or Less described as, “Tim Harford explains – and sometimes debunks – the numbers and statistics used in political debate, the news and everyday life”.
In this well crafted and well researched article for the FT he turns his attention to Big Data. He writes;
“Cheerleaders for big data have made four exciting claims: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.
Unfortunately, these four articles of faith are at best optimistic oversimplifications. At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge University, they can be “complete bollocks. Absolute nonsense.”
The media world is abuzz with the promise of big data and such sobering words might read like an unwelcome debunking of the latest big thing. Big Data is the future, right?
Those who have worked in print media for many years have been part of a world in which there is Big Data in the sense that census, (not sample based research) is available for circulation sales.
Publishers know how many copies of a newspaper or magazine have been sold. They know how many they printed, shipped out and received back unsold. They employ the Audit Bureau of Circulation to ensure the count is rigorous.
Often the trends shown in this sales data and in the readership data don’t tally. Readership figures relate to the number of people who read a publication. In an ideal world there would be a very strong correlation between numbers of copies sold (circulation) and the readership i.e. the number of people who read it (roughly 2 to 3 readers per copy would seem normal).
Readership figures are collected using a sample based research study called The National Readership Survey. Being sample based it is often blamed when circulation trends and readership trends are out of step. Sampling error and statistical variation are pointed at suspiciously – not unduly.
In actual fact the relationship between readership and circulation is quite complicated. In his 1993 Readership Symposium paper, Guy Consterdine identified 20 separate factors which could influence RPC (Readers per Copy). He concluded that ‘it is impractical to build a mathematical model which predicts the Readers Per Copy for a given publication with useful accuracy’.
A 1996 Admap article by Jane Perry, then the European Head of Research at Young and Rubicam, made this point about the relationship between Circulation and Readership:
“The underlying assumption of most of these [symposia] papers is that circulation is the independent variable. A circulation figure is an independently audited census of sales. Readership is based on a sample, and uses a questioning technique which has been demonstrated endlessly at these symposia to produce varied results depending on all kinds of factors. The circulation measurement must seem more valid. If readership figures do not change in the same way as circulation, and at the same time, then there must be something wrong with the readership measurement technique.
In practice, I have found that the evidence is not that clear-cut. Readership figures may sometimes be more consistent, logical and helpful than circulation trends, at least as far as advertisers are concerned. Changes in readership may even precede changes in circulation, however illogical that may appear.
The reason for this is that circulation figures are much more closely under the control of the publisher than is readership. Readership figures may therefore reflect real changes in a title’s readership more impartially than circulation”.
Two examples. You want to buy a bottle of water and the newsagent tells you that a free bottle of water comes free when you buy a certain newspaper. You take the paper and the free bottle, drink the water but don’t have time to read. The publisher has been able to lead your purchase via water but couldn’t make you read.
Example two. The publisher distributes a certain number of copies but demand is high that day and people end up borrowing copies from friends. The popularity was not totally checked by the control the publisher had over printing numbers. In both examples the publishers had influenced the level of sale that day but had less direct control over readership levels.
If you just wanted to know how many papers had been sold that day the circulation figure would suffice. If you’re using circulation or readership figures as an indication of brand health or exposure to the advertising – the readership figures have something to offer.
Zoom forward from 1990s Readership Symposia to the present day. The Big Data that many media companies now have is not even census data. Here’s David Brennan writing his Media Native column for Mediatel:
“Whatever happens, it is now a given that we can no longer have ‘census’ data from behavioural tracking. However the data is collected, it will almost certainly come from a sample of consumers – hopefully a very large sample, but a sample nonetheless – which will have to be modelled against the wider population and we will need to understand how those who elect not to be part of this personal data economy can be represented by those who are”.
The article explores the likelihood of data sets being even less complete in the future because of privacy concerns and the loss of potency of the cookie in behavioural tracking. He concludes that:
“All of the evidence suggests that the analytics world can learn far more from the world of sample-based research than vice-versa. But then, all of the evidence suggests it will all be sample-based in the future. In which case, let’s hope that the size of the sample will be matched by the quality of the data it produces”
Tim Harford might add:
“Statisticians are scrambling to develop new methods to seize the opportunity of big data. Such new methods are essential but they will work by building on the old statistical lessons, not by ignoring them”.
Print legacy media has a history of revering census data in the shape of circulation figures. Jane Perry’s work shows that this reverence is not always due. Publishers should never dismiss sample based data – it has a role to play in forming a clear picture of consumers. Also, contrary to popular belief, media companies are going to have to become more, not less, comfortable with sample based figures.