Chapter 1 Introduction

In 2017, I changed industries and joined a startup company where I was responsible for building up a data science discipline. While we already had a solid data pipeline in place when I joined, we didn’t have processes in place for reproducible analysis, scaling up models, and performing experiments. The goal of this book is to provide an overview of how to build a data science platform from scratch for a startup, providing real examples using Google Cloud Platform (GCP) that readers can try out themselves.

This book is intended for data scientists and analysts that want to move beyond the model training stage, and build data pipelines and data products that can be impactful for an organization. However, it could also be useful for other disciplines that want a better understanding of how to work with data scientists to run experiments and build data products. It is intended for readers with programming experience, and will include code examples primarily in R and Java.

1.1 Why Data Science?

One of the first questions to ask when hiring a data scientist for your startup is: how will data science improve our product? At the past startup I worked at, Windfall Data, our product was data, and therefore the goal of data science aligned well with the goal of the company, to build the most accurate model for estimating net worth. At other organizations, such as a mobile gaming company, the answer may not be so direct, and data science may be more useful for understanding how to run the business rather than improve products. However, in these early stages it’s usually beneficial to start collecting data about customer behavior, so that you can improve products in the future.

Some of the benefits of using data science at a start up are:

  • Identifying key business metrics to track and forecast
  • Building predictive models of customer behavior
  • Running experiments to test product changes
  • Building data products that enable new product features

Many organizations get stuck on the first two or three steps, and do not utilize the full potential of data science. A goal of this book is to show how managed services can be used for small teams to move beyond data pipelines for just calculating run-the-business metrics, and transition to an organization where data science provides key input for product development.

1.2 Book Overview

Here are the topics I am covering in this book. Many of these chapters are based on my blog posts on Medium1.

  • Introduction: This chapter provides motivation for using data science at a startup and provides an overview of the content covered in this book. Similar posts include functions of data science, scaling data science and my FinTech journey.
  • Tracking Events: Discusses the motivation for capturing data from applications and web pages, proposes different methods for collecting tracking data, introduces concerns such as privacy and fraud, and presents an example with Google PubSub.
  • Data pipelines: Presents different approaches for collecting data for use by an analytics and data science team, discusses approaches with flat files, databases, and data lakes, and presents an implementation using PubSub, DataFlow, and BigQuery. Similar posts include a scalable analytics pipeline and the evolution of game analytics platforms.
  • Business Intelligence: Identifies common practices for ETLs, automated reports/dashboards and calculating run-the-business metrics and KPIs. Presents an example with R Shiny and Data Studio.
  • Exploratory Analysis: Covers common analyses used for digging into data such as building histograms and cumulative distribution functions, correlation analysis, and feature importance for linear models. Presents an example analysis with the Natality public data set. Similar posts include clustering the top 1% and 10 years of data science visualizations.
  • Predictive Modeling: Discusses approaches for supervised and unsupervised learning, presents example classification models, and methods for evaluating offline model performance.
  • Model Production: Shows how to scale up offline models to score millions of records, and discusses batch and online approaches for model deployment. Similar posts include Productizing Data Science at Twitch, and Producizting Models with DataFlow.
  • Experimentation: Provides an introduction for testing product deployments, discusses how to use staged rollouts for running experiments, and presents an example analysis with R and bootstrapping. Similar posts include A/B testing with staged rollouts.
  • Recommendation Systems: Introduces the basics of recommendation systems and provides example implementations of recommender systems in four different programming languages. Similar posts include prototyping a recommender.
  • Deep Learning: Provides a light introduction to data science problems that are best addressed with deep learning. Demonstrates how deep learning can be applied to shallow learning problems with custom loss functions and presents an example for predicting home values.

1.3 Tooling

Throughout the book, I’ll be presenting code examples built on Google Cloud Platform. I choose this cloud option, because GCP provides a number of managed services that make it possible for small teams to build data pipelines, productize predictive models, and utilize deep learning. It’s also possible to sign up for a free trial with GCP and get $300 in credits. This should cover most of the topics presented in this book, but it will quickly expire if your goal is to dive into deep learning on the cloud.

For programming languages, I’ll be using R for scripting and Java for production, as well as SQL for working with data in BigQuery. I’ll also present other tools such as R Shiny. Some experience with R and Java is recommended, since I won’t be covering the basics of these languages.

This book is based on my blog series “Data Science for Startups”2. I incorporated feedback from these posts into book chapters, and authored the book using the excellent bookdown package (Xie 2018). All of the code examples for this book, along with the R markdown files used to author the text, are available online3.

References

Xie, Yihui. 2018. Bookdown: Authoring Books and Technical Documents with R Markdown. https://github.com/rstudio/bookdown.