The modern data stack is not dead, in spite of the many conversations going around the internet to this tune. Challenging? For sure. Bloated? Yes. Out of touch with reality? Most of the time. But thinking about the purpose of these combined technologies — using data for analytics and activation — the need is greater than ever. Maybe we will start to call it something different, but this set of technologies and the purposes they serve still represent a major growth opportunity for most businesses today. Especially for brands in the DTC space, which grew quickly without time to get the right infrastructure into place.
Yes, data is complex. Data modeling is complex. Finding the right tools to get value from data is…complex. A lot of the time it’s more like a swamp than a stack. Regardless, it’s just as important as ever (maybe more so) to know what customers want and what they are going to want in the future. Analytics are invaluable in making critical decisions and data activation is still the best way for operations and marketing to be more effective.
For most DTC brands today, data confidence is so far down the list of tasks that it’s barely even visible on the horizon.
The mature stack you see today usually looks something like this: data sources > data integration > data lake/warehouse > reverse ETL > destination. Instead of a data lake/warehouse there might be a customer data platform (CDP). More likely though, unless you’re at a large enterprise with sophisticated technical capabilities, your stack looks a lot like that one. Very occasionally you will see a “data quality” component included, which will typically test and document data issues for the internal team to fix. Data confidence is an afterthought, if it’s considered at all.
Depending on what solution is being used for each of these components, there might be some level of data cleaning or even identity resolution built into one or more of these layers — something that makes you believe that the data your business relies on is “good enough.”
Maybe you have a data monitoring solution that filters out data that doesn’t meet certain criteria into a separate workflow, or potentially you’ve even spent the time writing the code to add this to your data warehouse. Maybe you have someone internally who manually cleans data or has spent the time to build a custom solution. Perhaps you know the data is a mess, but there are just too many other customer-facing priorities, and it’s not urgent. With the current talent shortage and lack of available expertise in this field, it’s easy to deprioritize.
For DTC brands that are trying to scale amidst heavy competition, data confidence just might be the edge that will allow you to leapfrog the competition. According to Gartner, “Every year, poor data quality costs organizations an average $12.9 million. Apart from the immediate impact on revenue, over the long term, poor quality data increases the complexity of data ecosystems and leads to poor decision making.”
On the surface, it might seem like raw first-party data would give a complete and accurate picture of what’s happening for a business — that every data point would represent something real that happened and could therefore be turned into information by asking the right questions. In truth, data is messy and full of errors.
In the DTC ecommerce industry, there are more than 75 pieces of data associated with the average customer. This is a combination of accurate/valid data and inaccurate/invalid data; more than two-thirds of it typically requires some cleaning or validation. This number is even higher for businesses that rely on a lot of promos, have switched vendors or run multiple marketing campaigns at once. And when the data is spread across systems that don’t integrate with each other, it’s obvious how a first-party dataset can have major confidence issues.
Duplicate data is one small example that can have a major business impact. Lifetime value is an important metric that DTC brands use, and even 3% duplicate customer records can mean the calculation is off. This can impact valuation, profitability calculations and other very important metrics. There are endless examples just like this one.
For most companies that have a mature data stack, data is cleaned by some sort of rule-based filtering after it’s ingested into the data warehouse or lake. But data cleaning is complicated, and there are hundreds of specific scenarios data scientists need to plan for just to make the data accurate. Then it still needs to be verified, linked across data sources and unified around real customers to return useful analytics that result in effective decisions.
If these steps are done at all, they’re typically handled by an internal resource. The first step would be to manually verify every data point, which can be an endless task. (Do you know if that phone number belongs to the White House, or if the area code is even real?) Then they might write some code that can match on unique identifiers like email or phone across sources, and maybe even outsource identity resolution, if they can afford it.
A data confidence layer completely removes the hands-on time (weeks or even months) spent by employees, and also the margin for error that comes along with creating something from scratch. It’s not tied to a legacy system, won’t care if you change or add solutions as the business grows and does not require a data scientist to understand or use. This means that the marketing department can have its data cleaned before the next big campaign, without requiring a second thought from IT.
If your business uses a CDP to activate data — an expensive option that’s only effective if you’re making use of all of the benefits — there’s probably a level of identity resolution based on deterministic matching. While matching based on exact pieces of information is an important first step in identity resolution, it leaves a lot on the table. Yes, there are unique phone numbers and email addresses across systems that can be resolved easily to specific users. It is critical to do this step. But if your CDP is only matching deterministically, it’s leaving a large portion of data still out there unresolved.
One of the most obvious ways deterministic identity resolution fails to find matches is when customers use different phone numbers or email addresses (or make mistakes) when interacting with a brand. Deterministic matching can also make incorrect matches if the assumed unique identifiers are not valid. For example, many people will often use fake or public phone numbers or email addresses and can then be connected around a common user ID.
While there are clear problems with only using deterministic matching, many probabilistic matching algorithms are not the solution. This method only allows data to be compared within a single source. It is also very computationally intensive, can require hand labeling and needs frequent fine tuning.
For identity resolution to create a full and accurate customer picture, it requires both types of data matching. First matching on deterministic data, then advanced machine learning to do probabilistic matching. This isn’t offered today by any warehouse and less than a handful of CDPs provide the solution at an affordable price. Decoupling these solutions is likely a cheaper and more effective solution for your brand.
No matter where the modern data stack goes, confidence is a critical piece of the puzzle for DTC brands. Given the purpose of modern data — analytics and activation — there’s really no point if the data isn’t representing what’s actually happening IRL. If your brand doesn’t fully understand the customer journey, making decisions that result in sustainable and scalable growth is a shot in the dark.
OritaCo-founder and CEO Daniel Brady (DB) loves a challenge, as evidenced by his PhD in neurobiology from Harvard University. After solving the messy data problem over and over for ecommerce brands, he set out with partner Zack Gow to solve it once and for all. With advanced machine learning technology plus plenty of experience doing it the hard way, they’ve made it easier and cheaper than ever for customer data to be cleaned, verified and unified around IRL customers. Today, DB spends his time helping DTC brands get the clean data they need to inform better decisions.