Are You Curating the Best Data for Content Analytics? (Part I)

There is no denying that the next best thing for content analytics is data science. Yet for many content publishers this is often unattainable with their current data sets. Vanity metrics may be readily available but have less flexibility, especially when unique users and pageviews are not enough to demonstrate growth and stickiness. Beware of Vanity Metrics (HBR) can also point to the other pitfalls of relying on counting beans on the surface.  Most likely, medium to large content publishers are moving from or adding metrics to WordPress plugins or Google analytics tools.[1]

Building data science structures is a harrowing endeavor because it encompasses all aspects of your business and technical components. Stay motivated through data science stories from BuzzFeed’s Blog and Intelligence Refinery29, two publishers that have devoted time, energy, and people to data intelligence.

I am sharing my retrospective from my year with growing digital publisher, which included setting up data pipelines, dashboards, and analytics to find business insights; plus sprinkling tidbits from consulting with early stage start-ups. The reflections are my own opinions, while the exact data is not shareable so open source datasets from UCI Machine Learning Repository are leveraged to discuss practical insights. The most relevant datasets on UCI repo are (1) Mashable’s Online News Popularity and (2) Bank Marketing datasets. Download and take a gander at the datasets before proceeding to the next part of this article. 

A. What Data Should I be Collecting for Success? 

  1. Focus on targets and goals:
    1. What behaviors are important to identity, understand, and potentially influence in business practices. In the sample data, Online News Popularity data targets “shares” by article and Bank Marketing data targets “purchases” by user.
  2. Separate audience and assets for tracking metrics:
    1. How are entities broken down in your business? Consider audiences and entities driving growth, revenue, and expenses in the business, this can range from audiences, businesses, employees, customer service calls, articles/content, shares, and app features. In the sample data,Online News Popularity is evaluating “articles” and Bank Marketing is analyzing “users” as their central entities.
  3.  Get the most granular metric without breaking the bank:
    1. Can the way things are being counted and track be broken down to a smaller value? For instance, the smaller value of time is seconds and lowest level for clicks are by user. When all counts are the most detailed level then your data team can mix and match different targets/goals (from #1) and audiences/assets (from #2) in many different ways.

B. How to Design Metrics? Evolve from Vanity Metrics: 

Designing your own metrics effectively merges business knowledge and data resource. The cycle of metrics design, data collection, and analysis are every evolving so even if things are planned perfectively, metrics may be discarded at later stages. Stay open minded and flexible. In the sample datasets, measures are explicitly designed for certain analysis and we will dive into these together. Selected metrics are listed from the sample data set to give you ideas for personalizing metrics to your business.

Question 1: Why were the sample datasets be created in the first place? Ask your business the same question about your existing data sets. Make sure you ask the follow up question on how to connect new and existing data sets across the business.

Answer 1: Context of datasets created within the business and data science:

  • Online News Popularity: What attributes of contents affects the sharing behavior of each article? This question assumes that content attributes function without a relationship to users sharing content. You’ll notice that there is no information about users, except for how many times the article was shared.
  • Bank Marketing: When and why have customers responded to a specific marketing offer? Can the business predict customers’ response to offers based on demographics, financial situation, and previous marketing interactions? The actionable outcomes from these insights were determining the allocation of resources, creating customer engagement strategy, and determining the bank’s best customer acquisition profile.

Selected Mashable’s Online News Popularity metrics: Each column is meticulously designed by Mashable’s data team and some times outputs from completed unrelated data science algorithms. For instance, measuring sentiment requires natural language processing (NPL) and LDA is a method of identifying topics via unsupervised learning model. While the straight forward metrics, count number of words (n-tokens) and references and labels the article by day of the week or topic category.

  • Sentiment measures, such as average polarity, positivity, negativity
  • N-tokens in title, and content with average, unique, counts, stop words removed, keywords, etc
  • Number of references to other sites and self
  • Number of interactive features such as picture, videos, and etc
  • Data channels, such as entertainment, world news, and tech. Mashable has more categories than what is listed above and decided to be selective when developing the dataset.
  • Day of the Week Categories
  • LDA columns are outputs of an unsupervised data model, which at the high-level determine the relationship of each of the articles in the dataset. LDA is a common model to apply to bag of words to identify groups of related topics.

Selected Bank Marketing Metrics: The bank marketing data is interesting because it combines measurements of two types of behaviors, customers interactions with marketing campaigns and customer demographics.

  • Demographic measures of customers: gender, martial status, job, education and age
  • Financial attributes of customers, which is more relevant to financial institutions: loan, housing, balance, default
  • Bank Marketing activities and interactions with the customer: day, month, contact, previous date contacted (pday), previous output (poutcome)
  • External financial indicator. This dataset includes optional set of external market attributes for each record, such as GDP, consumer confidence, etc

Question 2: What does reviewing other content and marketing data sets elucidate about digital content analytics and data science? 

Answer 2:

  1. Learn what is being successfully implemented by other content and marketing businesses.
  2. Can your business currently mimic the sample data sets? If so, run and compare models for reference and benchmarking.
  3. What meaningful attributes are being tracked? Are there any perceived gaps after comparing your list to sample data sets?
  4. Are data sets across your business able to be combined and grouped? Similar to bank marketing data set which brought user data with campaign tracking. In the current landscape, connecting user by email or name can tie Facebook advertising, Google ad roll, and your internal user database.

[1] Wordpress says that more than 50% of digital content is hosted on their platform. While Google Analytics ranks in the top 4 most used traffic analysis tools.;

William Felker

Leave a Reply

Your email address will not be published. Required fields are marked *

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑