Big Data – What It Means For The Digital Analyst

Big Data for Digital Analysts

Cardinal Path article

I just Googled “Big Data” and I got 19,600,000 results... Where there was virtually nothing about two years ago there is now unprecedented hype. While the most serious sources are IBM, McKinsey, and O’Reilly, most articles are marketing rants, uninformed opinions or plain wrong. I had to ask myself… what does it mean for the digital analyst?

Below is a figure from Google Trends showing the growth of search interest for "big data" as compared to "web analytics" and "business intelligence."

Big Data and Analytics Trends

Big Data - Hype

It’s not surprising Gartner position “Big Data” between “social TV” and “mobile robots”, midway toward the peak of inflated expectations – two to five years before reaching a more mature stage. The number of products boasting the “big data” mantra is exploding and mass-media is entering the fray as exemplified by the New York Times article “The Age of Big Data” and a series on Forbes entitled “Big Data Technology Evaluation Checklist”.

Big Data Hype Cicle

On the brighter side, concepts of Big Data are spurring cultural shifts within organizations, challenging outdated “business intelligence” approaches and raising awareness of “analytics” in general.

Innovative technologies being built for Big Data can readily apply to environments such as digital analytics. It might be worth noting there seems to be a diminishing interest in traditional web analytics as organizations mature toward broader, more complex and more value-add through advanced business analysis.

Big Data - Definition

There is no universal definition of what constitutes “Big Data” and Wikipedia offers only a very weak and incomplete one: “Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time”.

IBM offers a good, simple overview:

Big data spans three dimensions: Volume, Velocity and Variety.

  • Volume – Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
  • Velocity – Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.
  • Variety – Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more.

Bryan Smith of MSDN adds a fourth V:

  • Variability – Defined as the differing ways in which the data may be interpreted. Differing questions require differing interpretations.

Big Data - Technological Perspective

Big Data encompasses several aspects also commonly found in business intelligence: data capture, storage, search, sharing, analytics and visualization. In his book entitled “Big Data Glossary”, Pete Warden covers sixty innovations and provides a brief overview of technological concepts relevant to Big Data.

  • Acquisition: Refers to the various data sources, internal or external, structured or not. “Most of the interesting public data sources are poorly structured, full of noise, and hard to access.”
    Technologies: Google Refine, Needlebase, ScraperWiki, BloomReach.
  • Serialization: “As you work on turning your data into something useful, it will have to pass between various systems and probably be stored in files at various points. These operations all require some kind of serialization, especially since different stages of your processing are likely to require different languages and APIs. When you’re dealing with very large numbers of records, the choices you make about how to represent and store them can have a massive impact on your storage requirements and performance.
    Technologies: JSON, BSON, Thrift, Avro, Google Protocol Buffers.
  • Storage: “Large-scale data processing operations access data in a way that traditional file systems are not designed for. Data tends to be written and read in large batches, multiple megabytes at once. Efficiency is a higher priority than features like directories that help organize information in a user-friendly way. The massive size of the data also means that it needs to be stored across multiple machines in a distributed way.”
    Technologies: Amazon S3, Hadoop Distributed File System.
  • Servers: “The cloud” is a very vague term, but there’s been a real change in the availability of computing resources. Rather than the purchase or long-term leasing of a physical machine that used to be the norm, now it’s much more common to rent computers that are being run as virtual instances. This makes it economical for the provider to offer very short-term rentals of flexible numbers of machines, which is ideal for a lot of data processing applications. Being able to quickly fire up a large cluster makes it possible to deal with very big data problems on a small budget.”
    Technologies: Amazon EC2, Google App Engine, Amazon Elastic Beanstalk, Heroku.
  • NoSQL: In computing, NoSQL (which really means "not only SQL") is a broad class of database management systems that differ from the classic model of the relational database management system (RDBMS) in some significant ways, most important being they do not use SQL as their primary query language. These data stores may not require fixed table schemas, usually do not support join operations, may not give full ACID (atomicity, consistency, isolation, durability) guarantees, and typically scale horizontally (i.e. by adding new servers and spreading the workload rather than upgrading existing servers).
    Technologies: Apache Hadoop, Apache Casandra, MongoDB, Apache CouchDB, Redis, BigTable, HBase, Hypertable, Voldemort. See http://nosql-database.org/ for a complete list.
  • MapReduce: “In the traditional relational database world, all processing happens after the information has been loaded into the store, using a specialized query language on highly structured and optimized data structures. The approach pioneered by Google, and adopted by many other web companies, is to instead create a pipeline that reads and writes to arbitrary file formats, with intermediate results being passed between stages as files, with the computation spread across many machines.”
    Technologies: Hadoop & Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum.
  • Processing: “Getting the concise, valuable information you want from a sea of data can be challenging, but there’s been a lot of progress around systems that help you turn your datasets into something that makes sense. Because there are so many different barriers, the tools range from rapid statistical analysis systems to enlisting human helpers.”
    Technologies: R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, Bigsheets, Tinkerpop.
    Startups: Continuuity, Wibidata, Platfora
    .
  • Natural Language Processing: “Natural language processing (NLP) … focus is taking messy, human-created text and extracting meaningful information.”
    Technologies: Natural Language Toolkit, Apache OpenNLP, Boilerpipe, OpenCalais.
  • Machine Learning: “Machine learning systems automate decision making on data. They use training information to deal with subsequent data points, automatically producing outputs like recommendations or groupings. These systems are especially useful when you want to turn the results of a one-off data analysis into a production service that will perform something similar on new data without supervision. Some of the most famous uses of these techniques are features like Amazon’s product recommendations.”
    Technologies: WEKA, Mahout, scikits.learn, SkyTree.
  • Visualization: “One of the best ways to communicate the meaning of data is by extracting the important parts and presenting them graphically. This is helpful both for internal use, as an exploration technique to spot patterns that aren’t obvious from the raw values, and as a way to succinctly present end users with understandable results. As the Web has turned graphs from static images to interactive objects, the lines between presentation and exploration have blurred.”
    Technologies: GraphViz, Processing, Protovis, Google Fusion Tables, Tableau Software.

Big Data - Challenges

Big Data was discussed at the recent World Economic Forum, where they identified several opportunities where Big Data can be applied, but also two main concerns and obstacles on the path of data commons.

1. Privacy and Security

As Craig & Ludloff puts it in “Privacy and Big Data”, the conditions for a perfect storm are being set and Big Data blankets many aspects of right to privacy, Big Brother, international regulations, right to privacy vs security vs commodity, the impact on marketing and advertising…

Just think about EU cookie regulations, or more simply, a startup scavenging the social web to build very complete profiles of people – with email, name, location, interests and such. Scary! (I will share this little story in an upcoming post).

2. Human Capital

McKinsey Global Institute projects that the US will need 140,000 to 190,000 more workers with “deep analytics” expertise and 1.5M more data-literate managers.

Finding skilled “web analytics” resources is a challenge, and considering the height of the steps to reach serious analytics skills, this is certainly the other big challenge.

Big Data - Value Creation

All sources mention value creation, competitive advantage and productivity gains. There are five broad ways in which using big data can create value.

  • Transparency: Making data accessible to relevant stakeholders in a timely manner.
  • Experimentation: Enabling experimentation to discover needs, expose variability, and improve performance. As more transactional data is stored in digital form, organizations can collect more accurate and detailed performance data.
  • Segmentation: More granular segmentation of populations can lead to customize actions.
  • Decision Support: Replacing/supporting human decision making with automated algorithms which can improve decision making, minimize risks, and uncover valuable insights that would otherwise remain hidden.
  • Innovation: Big Data enables companies to create new products and services, enhance existing ones, and invent or refine business models.
  • Industry Sectors Growth: Each of those important outcomes can only become reality if sufficient and properly trained human capital is available.

Big Data Sectors

Areas Of Opportunities For Digital Analysts

With the evolution from “web analytics” to “digital intelligence”, there is no doubt digital analysts should gradually shift from website-centricity and channel specific tactics – as experts as we are - to a more strategic, business oriented and (Big) Data expertise.

The primary focus of digital analysts should not be on the lower-layers of infrastructure and tools development. The following points are strong areas of opportunities:

  1. Processing: Mastering the proper tools for efficient analysis under different conditions (different data sets, varied business environments, etc.). Although current web analysts we are undoubtedly experts at leveraging web analytics tools, most lack some broader expertise in business intelligence and statistical analysis tools such as Tableau, SAS, Cognos and such.
  2. NLP: Developing expertise in unstructured data analysis such as social media, call center logs and emails. From the perspective of Processing, the goal should be to identify and master some of the most appropriate tools in this space, be it social media sentiment analysis or more sophisticated platforms.
  3. Visualization: There is a clear opportunity for digital analysts to develop an expertise in areas of dashboarding and more broadly, data visualization techniques (not to be confused with the marketing frenzy of “infographics”).

Action plan

One of the greatest challenges will be to satisfy demand and supply of skilled resources. The current base of “web analytics” is generally not sophisticated enough to really leverage Big Data; filling the skills gap will necessarily involve growing “web analysts” into “digital analysts”.

This is where you can help:

  • Identify thought leaders;
  • Identify skills gap;
  • Identify learning opportunities and curriculum;
  • Share your thoughts, tools of the trade and tips with the community (through the Twitter #measure hashtag and @SHamelCP or reach me on Google+ as Stéphane Hamel).

Related Articles

  1. The Ultimate Definition of Analytics
  2. The Power Of Data & What It Can Tell Us [video]
  3. Big Data - Will It Win The Game? [cartoon]

Subscribe To Our Newsletter For Monthly Updates



Your e-mail will be kept private

Brendan Regan | June 2012

Great overview, and thanks for helping to undo some of the "hype." In terms of IBM's definition of BD, especially the "Variety" aspect, I read some other vendor's assertion that 90% of big data is unstructured. I think that approximate 90/10 ratio is key to understanding big data.

S.Hamel | June 2012

Thanks Brendan - 90% of Big Data might be unstructured, a probably also "unclean". For digital analysts, I think oftentimes we don't need more data, we simply need to make better use of the data that sits there, untapped (for example, merging multiple data sources, so more on the Variety side of things). Variability is certainly a big element since the online data we work often varies a lot (for example, the number of campaigns and page names is ever expanding, social media channels evolve and constantly change, etc.)

Thomas | June 2012

Well written article Stephane. Thanks for putting this together.

S.Hamel | June 2012

Thanks Thomas - I spent a fair amount of time researching the info and wrapping my head around this article!

Renco Smeding | June 2012

Thanks Stephane for a great overview of big data technologies and concepts. Hearing the word 'big data' being used more and more in the industry I think it's a great initiative to start the discussion on how it relates to digital marketing intelligence and identify opportunities to bridge the gap. I wanted to add my thoughts in the comments but it became to long so I created a separate post called: "From Digital Analyst to Big Data Expert" on my blog:

http://qlikmetrics.com/2012/06/from-digital-analyst-to-big-data-expert/

Hopefully it is usefull and I am looking forward to future posts about big data technologies and digital analytics!

S.Hamel | July 2012

Thanks Renco - I really appreciate your input on the role of Big Data for the digital analyst. When I read your definition of "data scientist" I thought the role you described was that of a race car mechanic - while the digital analysts don't want/need to understand all the underlying technologies, they still want to be 1st at the finish line by knowing how to get the most of it.

I agree with your conclusion - the role of digital analyst and data scientist are very different and learning Qlikview and Tableau is certainly a good starting point for digital analysts. At eMetrics Chicago last week, the closing remarks focused on Big Data - we asked the audience about Tableau or similar - only a couple of hands went up... We asked who uses R... and had to explain what it is!

Matt Gershoff | July 2012

Good post. Thanks. Not sure about part of the definition here.
'...big data must be used as it is streaming in to the enterprise in order to maximize its value to the business." I'm not sure about this one. Sure, streaming is often needed for online /real-time learning. But Map Reduce jobs are not streaming - they are batch.

I th

Web Analytics Europa | July 2012

Very good overview of the Analyst requirements in the Big Data World.
Every source says there is/will be a lack of experts around digital analytics. Great future of full data-power ahead of us as soon as we filled the human "gaps".

Doug Laney | July 2012

Great piece. Good to see the industry (e.g. IBM) finally adopting the "3V"s of big data over 11 years after Gartner first defined them. For future reference, and a copy of the original article I wrote in 2001, see: Deja VVVu: Others Claiming Gartner’s Construct for Big Data. --Doug Laney, VP Research, Gartner, @doug_laney

Doug Laney | July 2012

And Gartner's updated definition of Big Data just published in a research note by Mark Beyer and me:

"Big Data are high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making."

This reflects both our original "3Vs" plus the value-side of big data.

Ken P | July 2012

I cannot see taking a BI report of 3 million or more rows being handled to well in Excel. What is the standard process for processing the massive amounts of data. From source to excel if that's even in the chain.

S.Hamel | August 2012

@Matt: I agree with you - not everything needs to be streamed and real time. If we don't view it from the technical standpoint of "streaming", the notion of "streaming into the enterprise" would refer to the ease of access and ubiquity of leveraging this data at the right time - like a river of data flowing into every aspects of the business (isn't it poetic!)

@Web Analytics Europa: everyone cite the same source... but based on the market demand for web analytics and the growing interest for Big Data, we can easily foresee a huge demand for anything that relates to data, analytics, business analysis and optimization.

@Doug: Thanks for chiming in! This updated definition is sound to my hears and much more relevant than merely stating something like "too much data for what you can handle"!

@Ken P: One of the big difference I noticed is a shift from traditional BI which dictates you need a long planning period to define the data model, set up the technological infrastructure, bring the data in, and by the time it's done the business has evolved and it data warehousing and BI becomes a money pit... Now I frequently end up using an ETL tool (see Pentaho Kettle) to get just the right data from the source, transform it and store it in a locally optimized data store. Tools like Tableau/Qlikview/Spotfire can load millions of rows and optimize it to a state where it's reasonable to play with live, locally.

I have joked that the simplest definition of "Big Data" is "it doesn't fit in Excel" - and when you think of it, it's true for most people who wonder how to make the shift from a traditional approach to a Big Data one. Shifting away from Excel forces the analyst to change his approach, view the data differently, and explore new solutions.

And that's a whole lot of fun to do! :)

Ken P | July 2013

@S.Hamel I cannot believe this was from a year ago. As funny as it may seem I have just finished the Introduction to Data Science course on Coursera and have been heavily developing applications in R since Jan 2013. The journey from digital analytics tools to broader data analytics has been awesome. You will realize what big data is very quickly and realize its just data from many sources that will either fit into local memory or will need a cloud based system to process. I assume the iterations of BD definitions comes from the scale of each organization challenges. Either way its a lot of floppy disks and no one size fits all MDM.

If I could suggest a few tips and need to know skills to ease the transition for people: R is an amazing platform I couldn't stress enough for every web analyst to obtain basic skills. Statistics and advanced statistics, you need to live and breath stats. There are plenty of free MOOCs and courses online that will get you up to speed. You can get lost in stats and R, use sites like stats.stackexchange and stackoverflow to get over the brick walls. Tableau is nice and quick but learn ggplot2 in R and D3. flowingdata.com is a great starter and there is enough documentation online to achieve anything within R. To summarize look at 3 major buckets: Visualization, programming, and statistics.

Before you know it you will be working basic SVMs and GLMs to predict events and ARIMAs for forecasting. The complex models should be left up to the phDs.

If using publicly available datasets from repositories isn't enough, then you would be happy to know you can connect to GA and Adobe with Rgoogleanalytics and Rsitecatalyst to pull in your own data from the vendors and start being awesome.

I found a great data science article on web analytics the other day that is a great resource for aspiring analysts: http://magazine.amstat.org/blog/2011/09/01/webanalytics/

Stephane Hamel | July 2013

Thanks Ken - things continue to evolve quickly - after the initial craziness of Big Data the "new shiny object" syndrom is fading away and people are gradually finding reasonable and useful use for Big Data technology and concepts. It seems we don't see as many "look how cool it is" type of articles and more "here's how you can actually do it" ones - it's gone from "cool" to "useful".

Thanks for all the useful links - I strongly recommend everyone to have a look!

Jonathan | August 2012

A really nice piece Stephane cutting through some of the hype surrounding Big Data. With regards to areas of opportunity for digital analysts + the technology you have listed, we (and when I say we I have got my BIME Analytics hat on) believe the launch of Google BigQuery was a gamechanger for Big Data.

The problem with Big Data solutions is that they are often expensive or very complex. BigQuery however is easy to use, easy to manage (nothing to do), cost effective (pay-as-you-go) and virtually limitless in terms of scalability. I think it opens up BigData analytics to the majority of people and makes it a viable option where it wasn't previously possible. With tools like BIME as a front end on top of BigQuery it is then possible to create powerful queries and visualize this data. What do you think of Google BigQuery ?

Nico Roddz | October 2012

Thanks Stephane!

I spent the whole night reading and learning more about this topic :)

Also, I wrote an article on my blog (in spanish) discussing the main ideas of this post:

http://nicoroddz.com/big-data-un-gran-reto-y-una-gran-oportunidad/

S.Hamel | October 2012

I noticed - thanks for sharing! :)

Ganeshan Nadarajan | January 2013

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

OT Support | June 2013

Hi Stephane, great definition/ simplification of defining Big Data, humorous & true at same time. We've added your definition to the (growing) list, now over 30 definitions of Big Data;
http://www.opentracker.net/article/25-definitions-big-data
Please let us know if you want to edit anything or add a visual, etc.

Stephane Hamel | June 2013

Thanks "OT Support" :)
The first time I mentioned this analogy was in a Tweet dated July 2012 (https://twitter.com/SHamelCP/status/220259813865701379)

I was joking but it stoke a chord! Since then it has been repeated many times, including in an article on CMS Wire entitled "What is Big Data? Anything that won't fit in Excel #emetrics" at http://www.cmswire.com/cms/information-management/what-is-big-data-anyth...

Sarah Trell | July 2013

"Some sectors are positioned for greater gains from the use of big data" is right. Great piece Stephane. May I propose another V, for viability. - http://www.pros.com/big-vs-big-data/

Post new comment
The content of this field is kept private and will not be shown publicly.
Type the characters you see in this picture. (verify using audio)
Type the characters you see in the picture above; if you can't read them, submit the form and a new image will be generated. Not case sensitive.
Online Behavior © 2012