The next BriefingsDirect big data innovation case study interview highlights how InfoScout in San Francisco gleans new levels of accurate insights into retail buyer behavior by collecting data directly from consumers’ sales receipts.
In order to better analyze actual retail behaviors and patterns, InfoScout provides incentives for buyers to share their receipts, but InfoScout is then faced with the daunting task of managing and cleansing that essential data to provide actionable and understandable insights.
To learn more about how big — and even messy — data can be harnessed for near real time business analysis benefits, please join me in welcoming our guests, Tibor Mozes, Senior Vice President of Data Engineering, and Jared Schrieber, the Co-founder and CEO, both at InfoScout, based in San Francisco. The discussion is moderated by me, Dana Gardner, Principal Analyst at Interarbor Solutions.
Here are some excerpts:
Gardner: In your business you’ve been able to uniquely capture strong data, but you need to treat it a lot to use it and you also need a lot of that data in order to get good trend analysis. So the payback is that you get far better information on essential buyer behaviors, but you need a lot of technology to accomplish that.
Tell us why you wanted to get to this specific kind of data and then your novel way of acquiring.
Schrieber: A quick history lesson is in order. In the market research industry, consumer purchase panels have been around for about 50 years. They started with diaries in people’s homes, where they had to write down exactly every single product that they bought, day-in day-out, in this paper diary and mail it in once a month.
About 20 years ago, with the advent of modems in people’s homes, leading research firms like Nielsen would send a custom barcode scanner into people’s homes and ask them to scan each product they bought and then thumb into the custom scanner the regular price, the sales price, any coupons or deals that they got, and details about the overall shopping trip, and then transfer that electronically. That approach has not changed in the last 20 years.
With the advent of smartphones and mobile apps, we saw a totally new way to capture this information from consumers that would revolutionize how and why somebody would be willing to share their purchase information with a market research company.
Gardner: Interesting. What is it about mobile that is so different from the past, and why does that provide more quality data for your purposes?
Schrieber: There are two reasons in particular. The first is, instead of having consumers scan the barcode of each and every item they purchase and thumb in the pricing details, we’re able to simply have them snap a picture of their shopping receipt. So instead of spending 20 minutes after a grocery shopping trip scanning every item and thumbing in the details, it now takes 15 seconds to simply open the app, snap a picture of the shopping receipt, and be done.
The second reason is why somebody would be willing to participate. Using smartphone apps we can create different experiences for different kinds of people with different reward structures that will incentivize them to do this activity.
For example, our Shoparoo app is a next-generation school fundraiser akin to Box Tops for Education. It allows people to shop anywhere, buy anything, take a picture of their receipt, and then we make an instant donation to their kid’s school every time.
Another app is more of a Tamagotchi game called Receipt Hog, where if you download the app, you have adopted a virtual runt. You feed it pictures of your receipt and it levels-up into a fat and happy hog, earning coins in a piggy bank along the way that you can then cash-out from at the end of the day.
These kinds of experiences are a lot more intrinsically and extrinsically rewarding to the panelists and have allowed us to grow a panel that’s many times larger than the next largest panel ever seen in the world, tracking consumer purchases on a day-in day-out basis.
Gardner: What is it that you can get from these new input approaches and incentivization through an app interface? Can you provide me some sort of measurement of an improved or increased amount of participation rates? How has this worked out?
Leaps and bounds
Schrieber: It’s been phenomenal. In fact, our panel is still growing by leaps and bounds. We now have 200,000 people sharing with us their purchases on a day-in day-out basis. We capture 150,000 shopping trips a day. The next largest panel in America captures just 10,000 shopping trips a day.
In addition to the shopping trip data, we’re capturing geolocation information, Facebook likes and interests from these people, demographic information, and more and more data associated with their mobile device and the email accounts that are connected to it.
Gardner: So yet another unanticipated consequence of the mobility trend that’s so important today.
Tibor, let’s go to you. The good news is that Jared has acquired this trove of information for you. The bad news is that now you have to make sense of it. It’s coming in, in some interesting ways, as almost a picture or an image in some cases, and at a great volume. So you have velocity, variability, and volume. So what does that mean for you as the Vice President of Data Engineering?
Mozes: Obviously this is a growing panel. It’s creating a growing volume of data that has created a massive data pipeline challenge for us over the years, and we had to engineer the pipeline so that is capable of processing this incoming data as quickly as possible.
But we felt that we wanted to create a data pipeline that’s much faster, so we can bring data to our customers much faster. That’s how we arrived at Vertica. We looked at different solutions and found Vertica a very suitable product for us, and that’s what we’re using today.
Gardner: Walk me through the process, Tibor. How does this information come in, how do you gather it, and where does the data go? I understand you’re using the HP Vertica platform as a cloud solution in the Amazon Web Services Cloud. Walk me through the process for the data lifecycle, if you will.
Mozes: We use AWS for all of our production infrastructure. Our users, as Jared mentioned, typically download one of our several apps, and after they complete a receipt scan from their grocery purchases, that receipt is immediately uploaded to our back-end infrastructure.
We try to OCR that image of the receipt, and if we can’t, we use Amazon Mechanical Turk to try to make sense of the image and turn that image into text. At the end of the day, when an image is processed, we have a fairly clean version of that receipt in a text format.
In the next phase, we have to process the text and try to attribute various items on the receipt and make the data available in our Vertica data warehouse.
Then, our customers, using a business intelligence (BI) platform that we built especially for them, can analyze the data. The BI platform connects to Vertica, so our customers can analyze various metrics of our users and their shopping behavior.
Gardner: Jared, back to you. There’s an awful lot of information on a receipt. It’s supposed to be very complex, given not just the date and the place and the type of retail organization, but all the different SKUs, every item that’s possibly being bought. How do you attack that sort of a data problem from a schema and cleansing and extract, transform, load (ETL) and then making it therefore useful?
Schrieber: It’s actually a huge challenge for us. It’s quite complex, because every retailer’s receipt is different. The way that they structure the receipt, the level of specificity about the items on the receipt, the existence of product codes, whether they are public product codes like the kind of you see on a barcode for a soda product versus an internal product code that retailers use as a stock keeping unit internally versus just a short description on the receipt.
One of our challenges as a company is to figure out the algorithmic methods that allow us to identify what each one of those codes and short descriptions actually represent in terms of a real world product or category, so that we can make sense of that data on behalf of our client. That’s one of the real challenges associated with taking this receipt-based approach and turning that into useful data for our clients on a daily basis.
Gardner: I imagine this would be of interest to a lot of different types of information and data gathering. Not only are pure data formats and text formats being brought into the mix, as has been the case for many years, but this image-based approach, the non-structured approach.
Any lessons learned here in the retail space that you think will extend to other industries? Are we going to be seeing more and more of this image-based approach to analysis gathering?
Schrieber: We certainly are. As an example, just take Google Maps and Google Street View, where they’re driving around in cars, capturing images of house and building numbers, and then associating that to the actual map data. That’s a very simple example.
A lot of the techniques that we’re trying to apply in terms of making sense of short descriptions for products on receipts are akin to those being used to understand and perform social-media analytics. When somebody makes a tweet, you try to figure out what that tweet is actually about and means, with those abbreviated words and shortened character sets. It’s very, very similar types of natural language processing and regular expression algorithms that help us understand what these short descriptions for products actually mean on a receipt.
Gardner: So we’ve had some very substantial data complexity hurdles to overcome. Now we have also the basic blocking and tackling of data transport, warehouse, and processing platform.
Going back to Tibor, once you’ve applied your algorithms, sliced and diced this information, and made it into something you can apply to a typical data warehouse and BI environment, how did you overcome these issues about the volume and the complexity, especially now that we’re dealing with a cloud infrastructure?
Mozes: One of the benefits of Vertica, as we went into the discovery process, was the compression algorithms that Vertica is using. Since we have a large volume of data to deal with and build analytics from, it has turned out to be beneficial for us that Vertica is capable of compressing data extremely well. As a result of that, some of our core queries that require a BI solution can be optimized to run super fast.
You also talked about the cloud solution, why we went into the cloud and what is the benefit of doing that. We really like running our entire data pipeline in AWS because it’s super easy to scale it up and down.
It’s easy for us to build a new Vertica cluster, if we need to evaluate something that’s not in production yet, and if the idea doesn’t work, then we can just pull it down. We can scale Vertica up, if we need to, in the cloud without having to deal with any sort of contractual issues.
Schrieber: To put this in context, now we’re capturing three times as much data every day as we were six months ago. The queries that we’re running against this have probably gone up 50X to a 100X in that time period as well. So when we talk about needing to scale this up quickly, that’s a prime example as to why.
Gardner: What has happened in just last six months that’s required that ramp up? Is it just because of the popularity of your model, the impactfulness and effectiveness of the mobile app acquisition model, or is it something else at work here?
Schrieber: It’s twofold. Our mobile apps have gotten more and more popular and we’ve had more and more consumers adopt them as a way to raise money for their kid’s school or earn money for themselves in a gamified way by submitting pictures of their receipts. So that’s driven massive growth in terms of the data we capture.
Also, our client base has more than tripled in that time period as well. These additional clients have greater demands of how to use and leverage this data. As those increase, our efforts to answer their business questions multiplies the number of queries that we are running against this data.
Gardner: That, to me, is a real proof point of this whole architectural approach. You’ve been able to grow by a factor of three in your client base in six months, but you haven’t gone back to them and said, “You’ll have to wait for six months while we put in a warehouse, test it, and debug it.” You’ve been able to just take that volume and ramp up. That’s very impressive.
Schrieber: I was just going to say, this is a core differentiator for us in the marketplace. The market research industry has to keep up with the pace of marketing, and that pace of marketing has shifted from months of lead time for TV and print advertising down to literally hours of lead time to be able to make a change to a digital advertising campaign, a social media campaign, or a search engine campaign.
So the pace of marketing has changed and the pace of market research has to keep up. Clients aren’t willing to wait for weeks, or even a week, for a data update anymore. They want to know today what happened yesterday in order to make changes on-the-fly.
Reports and visualization
Gardner: We’ve spoken about your novel approach to acquiring this data. We’ve talked about the importance of having the right platform and the right cloud architecture to both handle the volume as well as scale to a dynamic rapidly growing marketplace.
Let’s talk now about what you’re able to do for your clients in terms of reports, visualization, frequency, and customization. What can you now do with this cloud-based Vertica engine and this incredibly valuable retail data in a near real-time environment for your clients?
Schrieber: A few things on the client side. Traditional market research providers of panel data have to put a very tight guardrails on how clients can access and run reports against the data. These queries are very complex. The numerators and denominators for every single record of the reports are different and can be changed on-the-fly.
If, all of a sudden, I want to look at anyone who shopped at Walmart in the last 12 months that has bought cat food in the last month and did so at a store other than Walmart, and I want to see their purchase behavior and how they shop across multiple retailers and categories, and I want to do that on-the-fly, that gets really complex. Traditional data warehousing and BI technologies don’t support allowing general business-analyst users to be able to run those kinds of queries and reports on-demand, yet that’s exactly what they want.
They want to be able to ask those business questions and get answers. That’s been key to our strategy, which is to allow them to do so themselves, as opposed to coming back to them and saying, “That’s going to be a pretty big project. It will require a few of our engineers. We’ll come back to you in a few weeks and see what we can do.” Instead, we can hand them the tools directly in a guided workflow to allow them to do that literally on-the-fly and have answers in minutes versus weeks.
Gardner: Tibor, how does that translate into the platform underneath? If you’re allowing for a business analyst type of skill set to come in and apply their tools, rather than deep SQL queries or other more complex querying tools, what is it that you need from your platform in order to accommodate that type of report, that type of visualization, and the ability to bring a larger set of individuals into this analysis capability?
Mozes: Imagine that our BI platform can throw out very complex SQL queries. Our BI platform essentially is using, under the hood, a query engine that’s going to run queries against Vertica. Because, as Jared mentioned, the questions are so complex, some of the queries that we run against Vertica are very different than your typical BI use cases. They’re very specialized and very specific.
One of the reasons we went with Vertica is its ability to compute very complex queries at a very high speed. We look at Vertica not as simply another SQL database that scales very well and that’s very fast, but we also look at it as a compute engine.
So as part of our query engine, we are running certain queries and certain data transformations that would be very complicated to run outside Vertica.
We take advantage of the fact that you can create and run custom UDFs that is not part of the ANSI 99 SQL. We also take advantage some of the special functions that are built into Vertica allowing data to be sessionized very easily.
Jared can talk about some of the use cases where we like to analyze user’s entire shopping trips. In order to do that, we have to stitch together different points in time that the user has gone through and shopped at various locations. And using some of the built –in functions in Vertica that’s not standard SQL, we can look at shopping journeys, we call them trip circuits, and analyze user behavior along the trip.
Gardner: Tibor, what other ways can you be using and exploiting the Vertica capabilities in the deliverables for your clients?
Mozes: Another reason we decided to go with Vertica is its ability to optimize very complex queries. As I mentioned, our BI platform is using a query engine under the hood. So if a user asks a very complicated business question, our BI platform turns that question into a very complicated query.
One of the big benefits of using Vertica is to be able to optimize these queries on the fly. It’s easy to do this with running the database optimizer to build custom projections, making queries running much faster than we could do before.
Gardner: I always think more impactful for us to learn through an example rather than just hear you describe this. Do you have any specific InfoScout retail client use cases where you can describe how they’ve leveraged your solution and how some of these both technical and feature attributes have benefited them — an example of someone using InfoScout and what it’s done for them?
Schrieber: We worked with a major retailer this holiday season to track in real time what was happening for them on Thanksgiving Day and Black Friday. They wanted to understand their core shoppers, versus less loyal shoppers, versus non-core shoppers, how these people were shopping across retailers on Thanksgiving Day and Black Friday, so that the retailer could try to respond in more real time to the dynamics happening in the marketplace.
You have to look at what it takes to do that, for us to be able to get those receipts, process them, get them transcribed, get that data in, get the algorithms run to be able to map it to the brands and categories and then to calculate all kinds of metrics. The simplest ones are market share; the most complex ones have to do with what Tibor had mentioned: the shopper journey or the trip circuit.
We tried to understand, when this retailer was the shopper’s first stop, what were they most likely to buy at that retailer, how much were they likely to spend, and how is that different than what they ended up buying and spending at other retailers that followed? How does that contrast to situations where that retailer was the second stop or the last stop of the day in that pivotal shopping day that is Black Friday?
For them to be able to understand where they were winning and losing among what kinds of shoppers who were looking for what kinds of products and deals was an immense advantage to them — the likes of which they never had before.
Gardner: This must be a very sizable decision point for them, right? This is going to help you decide where to build new retail outlets, for example, or how to structure the experience of the consumer walking through that particular brick-and-mortar environment.
When we bring this sort of analysis to bear, this isn’t refining at a modest level. This could be a major benefit to them in terms of how they strategize and grow. This could be something that really deeply impacts their bottom line. Is that not the case?
Schrieber: It has implications as to what kinds of categories they feature in their television, display advertising campaigns, and their circulars. It can influence how much space they give in their store to each one of the departments. It has enormous strategic implications, not just tactical day-to-day pricing decisions.
Gardner: Now, that was a retail example. I understand you also have clients that are interesting in seeing how a brand works across a variety of outlets or channels. Is there another example you can provide on somebody who is looking to understand a brand impact at a wider level across a geography for example?
Schrieber: I’ll give you another example that relates to this. A retailer and a brand were working together to understand why the brand sales were down at this particular retailer during the summer time. To make it clear for you, this is a brand of ice-cream. Ice cream sales should go up during the summer, during the warmer months, and the retailer couldn’t understand why their sales were underperforming for this brand during the summer.
To figure this out, we had to piece-together, along the shopper journey over time, not only in the weeks during the summer months, but year round to understand this dynamic of how they were shopping. What we were able to help the client quickly discover was that during the summer months people eat more ice-cream. If they eat more ice-cream, they’re going to want larger pack sizes when they go and buy that ice-cream. This particular retailer tended to carry smaller pack sizes.
So when the summer months came around, even though people has been buying their ice-cream at this retailer in the winter and spring, they now wanted larger pack sizes and they were finding them at other retailers, and switching their spend over to these other retailers.
So for the brand, the opportunity was a selling story to the retailer to give the brand more freezer space and to carry an additional assortment of products to help drive greater sales for that brand, but also to help the retailer grow their ice cream category sales as well.
Idea of architecture
Gardner: So just that insight could really help them figure that out. They probably wouldn’t have been able to do it any other way.
We’ve seen some examples of how impactful this can be and how much a business can benefit from it. But let’s go back to the idea of the architecture. For me, one of my favorite truths in IT is that architecture is destiny. That seems to be the case with you, using the combination of AWS and HP Vertica.
It seems to me that you don’t have to suffer the costs of a large capital outlay of having your own data center and facilities. You’re able to acquire these very advanced capabilities at a price point that’s significantly less from a capital outlay and perhaps predictable and adjustable to the demand.
Is that something you then can pass along? Tell me a little bit about the economics of how this architectural approach works for you?
Mozes: One of the benefits of using AWS is that it’s very easy for us to adjust our infrastructure on demand, as we see fit. Jared has referred to some of the examples that we had before. We did a major analysis for a large retailer on Black Friday, and we had some special promotions to our mobile app users going on at that point. Imagine that our data volume would grow tremendously from one day to the next couple of days, and then after when the promotion is over and the big shopping season is over, our volume would come down somewhat.
When you run an infrastructure in the cloud in combination with online data storage and data engine, it’s very easy to scale it up and down. It’s very cost efficient to run an operation where you can just add additional computing power as you need, and then when you don’t need that anymore, you can scale it down.
We did this during a time period, when we had to bring a lot fresh data online quickly. We could just add additional nodes, and we saw very close to linear scalability by increasing our cluster size.
Schrieber: On the business side, the other advantage is we can manage our cash flows quite nicely. If you think about running a startup, cash is king, and not having to do large capital outlays in advance, but being able to adjust up and down with the fluctuations in our businesses, is also valuable.
Gardner: We’re getting close to the end of our time. I wonder if you have any other insights into the business benefits from an analytics perspective of doing it this way. That is to say, incentivizing consumers, getting better data, being able to move that data and then analyze it at an on-demand infrastructure basis, and then deliver queries in whole new ways to a wider audience within your client-base.
I guess I’m looking for how this stands up both to the competitive landscape, but also to the past. How new and how innovative is this in marketing? Then we’ll talk about where we go next? Let’s try to get a level set as to how new and how refreshing this is, given what the technology enables both at cloud basis and the mobility basis and then the core stuff, the underlying analytics platform basis.
Schrieber: We have an example that’s going on right now around a major new product launch for a very large consumer goods company. They chose us to help monitor this launch, because they were tired of waiting for six months for any insight in terms of who is buying it, how they were discovering it, how they came about choosing it over the competition, how their experience was with the product, and what it meant for their business.
So they chose to work with us for this major new brand launch, because we could offer them visibility within days or weeks of launching that new product in the market to help them understand who were the people who were buying, was it the target audience that they thought it was going to be, or was it a different demographic or lifestyle profile than they were expecting. If so, they might need to change their positioning or marketing tactics and targeting accordingly.
How are these people discovering the products? We’re able to trigger surveys to them in the moment, right after they’ve made that purchase, and then flow that data back through to our clients to help them understand how these people are discovering it. Was it a TV advertisement? Was it discovered on the shelf or display in the store? Did a friend tell them about it? Was their social media marketing campaign working?
We’re also able to figure out what these people were buying before. Were they new to this category of product? Or did they not use this kind of product before and were just giving it a try? Were they buying a different brand and have now switched over from that competitor? And, if so, how did they like it by comparison, and will they repeat purchase? Is this brand going to be successful? Is this meeting needs?
These are enormous decisions. Often, hundreds of millions of dollars spent by major consumer goods companies on new brand launches to get this quick feedback in terms of what’s working and what’s not, who to target with what kind of messaging, and what it’s doing to the marketplace in terms of stealing share from competitors.
Driving new people to the product category can influence major investment decisions along the lines of whether we need to build the new manufacturing facility, do we need to change our marketing campaigns, or should we go ahead and invest in that TV Super Bowl ad, because this really has a chance to go big?
These are massive decisions that these companies can now make in a timely manner, based on this new approach of capturing and making use of the data, instead of waiting six months on a new product launch. They’re now waiting just weeks and are able to make the same kinds of decisions as a result.
Gardner: So, in a word it’s unprecedented. You really just haven’t been able to do this before.
Schrieber: It’s not been possible before at all, and I think that’s really what’s fueling the growth in our business.
Look to the future
Gardner: Let’s look to the future quickly. We hear a lot about the Internet of Things. We know that mobile is only partially through its evolution. We’re going to see more smart phones in more hands doing more types of transactions around the globe. People will be using their phones for more of what we have thought of as traditional business in commerce. So that opens up a lot more information that’s generated and therefore need to gather and then analyze.
So where do we go next? How does this generate additional novel capabilities, and then where do we go perhaps in terms of verticals? We haven’t even talked about food or groceries, hospitality, or even health care.
So without going too far — this could be another hour conversation in itself — maybe we could just tease the listener and the reader with where the potential for this going forward is.
Schrieber: If you think about Internet of Things as it relates to our business, there are a couple of exciting developments. One is the use of things like beacons inside of stores. Now we can know exactly which aisle people have walked down and what shelf they’ve stood in front of, and what product they’ve interacted with. That beacon is communicating with their smartphone and that smartphone is tied to our user account in a way that we’re surveying these individuals or triggering surveys to them, in-the-moment, as they shop.
That’s not something that’s been doable before. It’s something that the Internet of Things, and very specifically beacons linking with smartphones, will allow us to do going forward. That will open up entirely new fields of research and consumer understanding about how people shop and make decisions at the shelf.
The same is true inside the home. We talk about the Internet of Things as it relates to smart refrigerators or smart laundry machines, etc. Understanding daily lifestyle activities and how people make the choice of which product to use and how to use them inside their home is a field of research that is under-served today. The Internet of Things is really going to open up in the years to come.
Gardner: Just quickly, what are other retail sectors or vertical industries where this would make a great deal of sense.
Schrieber: I have a friend who runs an amazing business called Wavemark, which is basically an Internet of Things for medical devices and medical consumables inside of hospitals and care facilities, with the ability to track inventory in real time, tying it to patients and procedures, tying it back to billing and consumption.
Making all of that data available to the medical device manufacturers, so that they can understand how and when their products are being used in the real world in practice, is revolutionizing that industry. We’re seeing it in healthcare, and I think we’re going to see it across every industry.
Gardner: Last word to you, Tibor. Given what Jared just told us about the greater applicability. The model, the architecture comes back to mind for me, the cloud, the mobile device, the data, the engine, the ability to deal with that velocity, volume, and variability at a cost point that is doable and scales up and down. Are there any thoughts about this from an engineering perspective and where we go next?
Mozes: We see that with all these opportunities bubbling up, the amount of data that we have to process on a daily basis is just going to continually grow at an exponential rate. We continue to get additional information on shopping behavior and more data from external data sources. Our data is just going to grow. We will need to engineer everything to be as scalable as possible.
You may also be interested in:
- IT Operations Modernization Helps Energy Powerhouse Exelon Acquire Businesses
- How a Hackathon Approach Juices Innovation on Big Data Applications for Thomson Reuters
- How Waste Management Builds a Powerful Services Contiunuum Across Operations, Infrastructure, Development, and IT Processes
- GSN Games hits top prize using big data to uncover deep insights into gamer preferences
- Hybrid cloud models demand more infrastructure standardization, says global service provider Steria
- Service providers gain new levels of actionable customer intelligence from big data analytics
- How UK data solutions developer Systems Mechanics uses HP Vertica for BI, streaming and data analysis
- Advanced cloud service automation eases application delivery for global service provider NNIT
- HP network management heightens performance while reducing total costs for Nordic telco TDC
- How Capgemini’s UK financial services unit helps clients manage risk using big data analysis
- Perfecto Mobile goes to cloud-based testing so developers can build the best apps faster