How Data Can Improve Online Content Strategy? A Guide for Implementing Data Science for Online Content

There is no denying that data science helps with online content but for many and most content publishers is often unattainable with their current data sets. Vanity metrics may be readily available but are less flexibility. For instance vanity metrics may not be able to tell you about unique users. Aggregated pageview counts are generally not enough to demonstrate growth and stickiness. Beware of Vanity Metrics (HBR) can also point to the other pitfalls of relying on counting beans on the surface.  Most likely, medium to large content publishers are moving from or adding metrics to WordPress plugins or Google analytics tools.[1]

Building data science structures is a harrowing endeavor because it encompasses all aspects of your business and technical components. You can stay motivated through data science stories from BuzzFeed’s Blog and Intelligence Refinery29, two publishers that have devoted time, energy, and people to data intelligence.

I am sharing my retrospective from my year with growing digital publisher, which included setting up data pipelines, dashboards, and analytics to find business insights; plus sprinkling in experience from consulting with early stage start-ups. The reflections are my own opinions, while the data is not openly shareable. For these discussions I leveraged open source datasets from UCI Machine Learning Repository to show insights in practice. The most relevant datasets on UCI repo were (1) Mashable’s Online News Popularity and (2) Bank Marketing datasets.

Read the 4-part guide are focused sections on turning content metrics into data science:

  1. Part I: Are you Curating the Best Data for Content Analytics?
  2. Part II: Are you Asking the Right Business Questions for Data Exploration?
  3. Part III: Are you Making these Mistakes when Analyzing content data?
  4. Part IV: What are the wins necessary to stay motivated during implementation?

In the Context of the Content Business 

The content business has lots and lots of caveats. It can bring both tangible and intangible benefits to a business. The results are tangible since articles attract users, pageviews, ad clicks and shares. A little less trackable is how digital content defines a brand’s identity. Content can reinforce whatever a business wants a user to do online (aka conversions), especially outside the parameters and worlds immediately controlled by the business. Adding more resources to Google and Facebook advertising platform will only get you so far.

Interestingly, while content remains king for some, it is not viewed as a sustainable model for tech tart-ups. Content focused start-ups are not likely to secure funding since VCs do not see content as a “core” asset, consider competition from established institutions too fierce to overcome, and categorize content as “lifestyle” over technological innovation. TechCrunch. The last point about content and lifestyle points to the qualitative aspects of writing that matters to audiences, just to name a few: personality, authenticity, and relevance.

Content sharing and exchanges are evolving as well.  The native or sponsored content model is a newer approach for advertisers and content producers to get paid and push paid content to existing users. This approach can be executed well when the right site and brand match up. In this model, the reader is not guaranteed unbiased views, which the Federal Communications Commission and Federal Trade Commission wants brands to make “clear and conscious” to users the content that is endorsed.

Consider using insights below to answer the following questions:

  1. What data should the business collect?
  1. Collect the right targets and goals: What behaviors are important to identity, understand, and potentially influence in business practices. In the sample data, Mashable data targets “shares” by article and Bank Marketing data targets “purchases” by user.
  2. Separate audience and assets for tracking metrics: How are entities broken down in your business? Consider audiences and entities driving growth, revenue, and expenses in the business, this can range from audiences, businesses, employees, customer service calls, articles/content, shares, and app features.

What Data Should Be Collected?

Before being able to go where the data takes the analysis, the data team or scientist must design the inputs. View some of the dataset columns below to consider ways to frame your business and content. Notice in the sample datasets measure that each metric is explicitly designed for modeling and analysis. Often times there are models and algorithms created to narrow down and choose metrics.

Selected Mashable metrics:

    • Sentiment measures, such as average polarity, positivity, negativity
    • N-tokens in title, and content with average, unique, counts, stop words removed, keywords, etc
    • Number of references to other sites and self
    • Number of interactive features such as picture, videos, and etc
    • Data channels, such as entertainment, world news, and tech. Mashable has more categories than what is listed above and decided to be selective when developing the dataset.
    • Day of the Week Categories
    • LDA columns are outputs of an unsupervised data model, which at the high-level determine the relationship of each of the articles in the dataset. LDA is a common model to apply to bag of words to identify groups of related topics.

Selected Bank Marketing Metrics:

  • Demographic measures of customers: gender, martial status, job, education and age
  • Financial attributes of customers, which is more relavent to financial institutions: loan, housing, balance, default
  • Bank Marketing activities and interactions with the customer: day, month, contact, previous date contacted (pday), previous output (poutcome)
  • External financial indicator. This dataset includes optional set of external market attributes for each record, such as GDP, consumer confidence, etc

Why may have the sample datasets be created in the first place?

  • Mashable: What attributes of each article affects the sharing behavior of each article? Assumes article attributes function without a relationship to users sharing content. You’ll notice that there is no detailed information about users.
  • Bank Marketing: Predict why customers responded to a specific offer based on demographics, financial situation, and previous interactions. Actionable outcomes were probably allocation of resources to telemarketing and targeting customers that will response to certain offers and signal contact fatigue.


[1] Wordpress says that more than 50% of digital content is hosted on their platform. While Google Analytics ranks in the top 4 most used traffic analysis tools.;


Leave a Reply

Your email address will not be published. Required fields are marked *

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑