No Nonsense Intro to Data, Analytics, & Data Stacks
The following is based on a talk we gave to the Y Combinator W21 batch on how to get started with data, analytics, and business intelligence. If you’re interested in learning more or simply want to say hi, drop us a line at hello (at) prequel.co!
“Data” is a nebulous, fast-moving space. New products come out all the time, and best practices are still in flux. This makes it difficult to navigate as a startup founder or as a newcomer to the space. The two questions we hear the most from friends who are first learning about it are: what’s the minimum I need to know to get started, and how do I ramp up quickly so that I can leverage data for my own business?
Our goal with this series of blog posts is to answer those two questions. As a quick caveat, there is often a tradeoff between conciseness and completeness. We don’t claim the following to be The Authoritative & Comprehensive Guide To Navigating Data Stacks™. It’s more of a resource we wish existed for our friends and ourselves.
What do we mean by data?
To be honest, almost anything. But to be more specific, in this context, data usually refers to information used to gain insight into the business. Other words you might’ve heard to refer to it: analytics, business intelligence, BI. Data happens to sound rigorous and carries good branding and so the industry has reoriented around the term.
The type of insights derived from this data cover everything from hard numbers about the business such as revenue, to standard metrics that operators often care about such as customer acquisition cost (CAC) and lifetime value (LTV), to custom metrics that are specific to a given business (for an on-demand car service business, this might be something like average time to match a rider with a driver).
The important takeaway here is that data doesn’t HAVE to involve data science or machine learning. Of course, it can, but data often still refers to business insights, and that’s how the word is used by the bulk of the modern data stack industry.
What type of data are we talking about / where does the data come from?
Businesses today use many different software tools for their operations. Depending on their business model, most will use tools like a payment processor (eg Stripe), a CRM (eg Salesforce), their own production database(s) (eg MySQL / Postgres / NoSQL), some kind of user-event tracking (eg Segment), a web store-front (eg Shopify), and some ad providers (eg Google Ads) to name a few.
Each of these tools contains data that’s directly relevant to and can give insight into the business. That data is usually raw — for example, Stripe data might consist of a series of recorded transactions — and needs to be massaged to derive some insights from it.
What exactly is a data stack?
In its simplest version, a data stack is whatever tool or tools you’d use to turn raw data from across the business into useful, meaningful insights. So is Excel a data stack? Technically yes, I guess. But people usually refer to slightly more purpose-built tooling when they use the term.
Then what’s the modern data stack?
Data practitioners are aligning on a somewhat standard “flow” to turn raw data into insights. First, the data is (1) replicated (copied) from the various source systems (Stripe, Salesforce, prod db, etc.) to a single location (2) where it gets stored. Once centralized, (3) the data is transformed — this turns it from raw data (Stripe transaction line items) into meaningful insights (say monthly recurring revenue, with nuances of how much revenue was added/churned/upsold in a given month). Finally, those insights are (4) translated into a visual format to make them easier to consume — people typically much prefer to look at revenue on a graph than slog through a table.
As you’ll notice, the journey above has four specific steps/pillars. Together, these pillars form the foundation of the modern data stack. Each pillar has an industry-standard name and set of tools associated with it. They are:
- Extraction & Loading: this is the part responsible for extracting data from source systems and replicating it to a central location. This is often abbreviated to EL(T). If you hear someone talk about ETL, they’re probably referencing this as well — it’s now a misnomer but we can get into that some other time.
Sample popular tools: Fivetran, Stitch, Airbyte, Matillion
- Warehousing: this is the central location where all the data now gets stored.
Sample popular tools: Snowflake, Redshift (AWS), BigQuery (GCP)
- Transformation: this is where the raw data is turned into clean data and eventually insights.
Sample popular tools: dbt, Dataform (GCP)
- Visualization: this is where tabular data is turned into charts and dashboards to make it more consumable.
Sample popular tools: Looker, Tableau, Mode, Metabase
We’ve mentioned that the data space is evolving quickly. There are categories of tools emerging that are not captured above, yet are increasingly often included in data stacks — for example, data catalogs, reverse ETLs, and data observability tools. But the distinguishing factor of a modern data stack is the combination of the four core pillars listed. You can think of those as the starter pack.
Note that moving forward, we’ll use the terms data stack and modern data stack interchangeably.
What are the benefits of using a (modern) data stack?
This question in particular deserves its own post, but we should at least mention why teams bother setting up modern data stacks in the first place. Of the many benefits, the three most commonly cited are:
- Data is centralized, meaning analysis can be performed across data sources. This is powerful, and key to generating useful insights. Let’s take an example: by looking at Stripe data in isolation, it’s possible to track revenue. This is useful, but it’s unlikely to help make decisions about how to run a business.
However, by joining (merging) Stripe data with data from a CRM, an analyst might uncover that customers acquired through channel A churn at 3x the rate that customers acquired through channel B do, and so that channel B should really be prioritized. That’s a real-life business insight that can be used to drive tighter operations and yield tangible benefits.
- There is a single source of truth for important metrics. By using data transforms, one can turn raw data into metrics that get saved to the data warehouse. The result of those transforms is available to everyone, and can be used as the basis for all analysis. For example, instead of letting everyone define active users their own way, it’s now possible to define an active users table once and let everyone leverage that. Have you ever shown up to a meeting where various stakeholders had different numbers for the same metric? Us too. This lets you avoid that.
- Data is in a format that’s ready for analysis. Again, thanks to the power of transforms, data can be cleaned (eg remove test data, PII) and massaged into a format that’s easy to work with. This reduces query complexity, and reduces the number of errors people will make when generating insights. Numbers are good, but accurate numbers are great, especially when it comes to making decisions based on them.
Again, there are many benefits to the modern data stack, and we’re only scratching the surface here.
That’s it for now. Keep an eye out for our next post about data stacks and how to get up to speed as quickly as possible. In the meantime, if there’s anything you’d like to see us cover, or any question you’d like for us to answer, drop us a line at hello (at) prequel.co.
Curious about what we do?
Prequel is a managed modern data stack. Instead of setting up, configuring, and maintaining all the individual components yourself, Prequel provides you with a data stack that just works. It was designed to be maintainable by a single data analyst, no data engineers or engineering favors required, so you and your team can focus on generating insights.