REAL stories about REAL ML challenges from REAL companies. In each episode, our hosts Almog Baku and Amit BenDor, will bring you a different story of real-world stories deploying ML models to production. We'll learn from ML practitioners about the challenges that companies are facing, how they solve them, and what the future bears for us. --- Discord community - https://discord.gg/JcNRrJ5nqW Twitter - https://twitter.com/ai_infra_live Linkedin - https://linkedin.com/company/ai-infra-stories/ Almog - https://www.linkedin.com/in/almogbaku/ Amit - https://www.linkedin.com/in/amitbendor/
Tue, 31 Jan 2023 16:33
Feature Stores are a new and exciting innovation in AI infrastructure. They help organizations store, manage and serve their machine learning features to support the model serving.
In this episode of AI Infra Stories, we're thrilled to bring you a conversation with Hien Luu and Brian Seo from Doordash. They'll take us through the story of how they built their Feature Store and why it was crucial to their operations. You'll hear about their challenges while scaling it up and how they solved them. This is a rare opportunity to gain insight into one of the cutting-edge AI infrastructure projects in the industry and learn from the experts themselves. Tune in for a fascinating episode of AI Infra Stories!
---
Doordash's feature store blog post - https://doordash.engineering/2020/11/19/building-a-gigascale-ml-feature-store-with-redis/
Join our Discord community - https://discord.gg/JcNRrJ5nqW
Twitter - https://twitter.com/ai_infra_live
Linkedin - https://linkedin.com/company/ai-infra-stories/
Almog - https://www.linkedin.com/in/almogbaku/
Amit - https://www.linkedin.com/in/amitbendor/
Hien Luu - https://www.linkedin.com/in/hienluu/
Brian Seo - https://www.linkedin.com/in/brian-seo-1a246b30/
You only see Bay Eye Info Stories, where your story is behind real-elemental challenges from real companies. This is a model. I'm a cloud and Emily infrastructure innovator. This is a myth I'm an AI research leader. In each episode, we will host a different guest who will share a story from their journey about the challenge they faced, how they solved it, and what the future holds for us. Let's get started. Hey, today we are with Hynne. Hynne is the head of ML platform in Doudash and with Bynne. Hynne is a software engineer in the ML platform team. Hi guys, how are you doing today? Well, hey guys, thank you. Pretty good. I'm excited. I'm first asked. That's exciting. Can you maybe, Brian, can you tell us a little bit about yourself and your background? Yeah, sure. I've been at Doudash for about three and a half years now and my career has been all over the place, but I'm sure I used to be a machine learning practitioner, built models, and switched over to the software engineer inside a year ago. Just because I really realized that I enjoyed the scaling of my machine learning operations more so than I did the actual model training that I'm really loving it so far. It was called a niche and feature storage that I really liked. Cool. And Hynne, what's about you? What brought you to the ML info wall? Yeah, so I think with Doudash for about two and a half years, ish joined right at the end of the COVID-19, currently leading the machine learning platform team. It's been a fun ride so far of building an ML platform almost fun to ground up. Before Doudash, I was at LinkedIn and spent many years there focusing on like the day infrastructure. As well as machine learning infrastructure at LinkedIn as well. I got really into machine learning in front to the end of my time there at LinkedIn. And the opportunity I Doudash, I came up to come in and help build a team as well as the platform. So that's why I'm really, this is an exciting area for me and I've been learning a lot. I'm happy to have a chat with you guys this morning. Thank you. So this is like a very exciting and vibrant area and every other day we hear about new stuff and new titles and new roles. So how's your typical day look like? My day is for my role is leading the team and you know, a direction of the platforms will be. So a lot of it is managing the team as well as talking to customers. We are at right now we doing like next corner planning. So looking toward 223. So how do we think about what the platform needs to be as we move forward? Like you mentioned, this field is evolving very quickly. So we want to see now, how can we incorporate some of your best practices and new technologies into the platform to best support our. Inhouse customers and their use cases. Yeah, as for me, I guess like the first thing I do when I go through our alerts. And like we go in a cool sweat like hoping that I don't see like 500 alerts or messages next because the giant pile on board your request. And then if I get a chance spend some time thinking about what we can do next, then how we can take teacher storytelling next stage and improve our teacher engineering so that we can really go to that next level. And I guess like most importantly trying to figure out how to reduce the volume of alerts and reduce the amount of support perhaps so that we can be like truly self service and like users don't really have to think about it. Awesome. Sounds a bit stressful, but also very interesting work. Yeah, yeah, it's not very long that it used to be, but for a while I was like in a cool sweat every day. I hope it will change soon enough. Awesome. So let's move on to the main part of the show, which is we would like you to present some real life challenge that you had and afterwards the solution. He can you tell me what challenge you chose to present. Yeah, today we did we discussing the challenge around scaling our feature store such way it's a very fascinating story and journey a lot of learnings for us. These interesting challenges will only come up when you start operating at scale. That's when you know we can't really find obvious solutions out there and you have to tailor your solution based underneath and use cases that you have to deal with. So that's the one that we're going to be focusing and sharing. And then Brian is one of these looks that are actually looking at the man building out an ex generation of this feature store to help us it was scale better in the future as well as being more cost efficient as well. So here maybe an intro for those who don't know what the feature store is what is the usage. Sure, at a high level a teacher store you can think of it is a database. It could be a key value store and why not they tell you on your needs. But mostly people using key value kind of stores engine for this. And it's designed to store features right machine learning features that can be with street to feed into the models for performing predictions and mostly for online prediction use cases and we have a lot of those at door dash right the moment you lock into door to ask website. You see a lot of recommendations and whatnot and to in order to generate those recommendations they will have to light be a lot of predictions and time to what you like what you don't like and whatnot and that process will require with required to fetch a lot of features from online feature store and low latency. So this is one of the key aspects of the feature store is to provide very little latency retrieval. We are features at scale across all the many millions of users boring food and all that is getting each morning at lunch time and so on. It's like it's a fairly widely used so what was the issue with scaling this infrastructure like it's also pretty new I would say new wish concept. Yeah, I think it's a feature store concept is fairly new. However, there's a lot of companies are now focusing on building commercial solutions and whatnot. And so years ago it was the not many of the solutions a lot of companies like us when we started on journey it's it's we lying on somebody existing technology to build out the feature store solutions. This is like we you know originally we pick redis as our feature store for a couple of reasons one of them is the low latency aspects and it's fit into the key value kind of storage engine very nicely. It works well at the beginning. But as like I mentioned at as we scale in time to the number of use cases and the number of features as this is scroll and all that this stuff. Then you know you start to see cracks in time so the efficiency in terms of areas that we can do better. And whatnot and Brian can talk more about this and also highlight it in the block that I'm sure we probably will reference it at some point time. Maybe why maybe you can share some of the interesting challenges that you have seen an observe since you joined a team in the last year you have. Yeah so I think just like marrying a lot of wood he ends in redis is awesome it's super fast and it's super for it. But it's just really expensive because you're storing everything in memory and then like you said once you get to a scale pretty much every dollar you put in it is going up linearly in cost. And then so it's going up when you're early at a pretty high rate a lot of the I think main priorities when they person introduced the new storage format work to like improve from part performance and to reduce costs and it just so happened that those two coincided with kind of the way that we decide to store our features from them. It's really fascinating maybe we'll take a little bit step back so your feature store is basically storing the data before you're doing the predictions. Are we referencing for a feature store for only the online data or also are we hosting like the historical data in the feature store how is your solution look like. Mostly storing this data for online models just because that's really the only use case where you need to make predictions with that high latency and then for offline storage we tend to rely on snowflake in s3. But it's just cheaper and easier to fetch it that way especially with the nature of offline predictions they tend to be fetched and matched and kind of OLAX storage works better to that kind of thing. Yeah I think the concept of feature store for us is evolving as well and also as an industry like you mentioned it could be the feature store is designed ideally to cover both online and offline or batch features ideally provide that abstraction around all the features whether regardless of when they store and how that we treat. I think that's an ideal way of thinking about it for us a lot of all use cases online so we think that our feature store is online. But as in although it's a big picture it's good to think from that from those perspective for both batch and online features. Okay so you said you had problems with the redis as the technology that is running the feature store and you said there are different holes can you elaborate a little bit about the problems that you had if you have any maybe story about that. So I think we've been using redis for a long time now and the scale has pretty much drastically expanded since we wrote that blog post but generally some of the issues are like the number one issue is just cost right there's no getting around how expensive it is and using memory as durable storage is just not like. Not super smart and because there's like lots of modes of failure and it's your I guess you always have to maintain a fleet of breath because so your costumes of the like to max the cost of storage in general. But other issues are that it's actually we use redis on the last the cash in AWS so we're not like directly managing the clusters ourselves and one of the issues with that is that when you need to upscale a cluster. It really eats up the CPU and causes a lot of shuffling and then so the cluster is just at the graded state for next then a period of time and the bigger the cluster gets the longer that made its operations take so we have to do all sorts of other side of migration operations background. And another issue is that because storage is so expensive on redis we have to implement lots of short TTLs and on red expression values so of some feature pipeline goes down for seven days or whatever TTL we said the speech start to start to start with the victim values and our models will separate the graded performance and that's like some of the core issues we can face the last few years. Alright so i'm pretty intrigued so we are saying that a do-dash you guys build like a feature store which is differently and focusing on the online and you replace the redis with something you built internally so what did you build. I wouldn't say we replaced redis we still need to read us a lot but I think we realize that in terms of like cost per I guess by stored not everyone needs to have ultra low latencies and what's super high availability or everything like that and then so we provided an alternate option was cockroach the big pretty recently and we've been testing that out for the last nine months or so. But also had to Brian before we explored the proxosed BB that was an effort to step back and think about what are some innovations that we can do to reduce the cost and because the number of factors of Ryan brought up is about cost so we took step back and look at the way we store features to where we organize data the features on the redis itself. And as highlighted a blog that we actually figured out a different way of storing features in a more efficient manner so therefore it uses less bites and therefore it reduces the latency of retrieving the features so that was someone optimization that was highlighting the blog and that helped a lot in terms of we're using costs as the first step. But also bright highlighted is not all use cases need that very low latency times the most seconds so we then tip it to a without a solution that we can use with those kind of use cases and that's what Brian alluded to with the work that we've been doing of exploring cockroach the B as the not replacement but in addition to redis because we want to even support a via a Y range of use. And what is the Y range of use cases and what is. So basically for data scientists they're going to be a unified like API way to work or the one who's working on the infrastructure but at some point it might work with one database and some point might work with another. We actually tried to do our best to abstract away the storage layer from a lot of our data scientists we just published another block was recently on something called fabricator or internal feature registry. And then so what we do is just we can just try to let user specify field some teams have like a very specific requirements so they can set up those configurations for other features and other users just don't really have any specific requirements so we just pointed to general storage and all in our serving layer for online models that we exposed other teams the storage is completely abstracted away where our teacher prediction service will just figure out in the background where teachers being stored and how to pull all of them together. And who is using the features to easy like data scientists or data engineers. Yeah, I guess there's it's kind of a ton of people is not really data engineers because it's mostly focused on online prediction we have a couple different types of practitioners. We have software engineers machine learning engineers at the scientists and then they all work together and tend them to launch a service. Sometimes they'll be working together in teams like the search team for example has like a whole swab of what's off our engineers machine learning engineers and research scientists. I would say the main distinction is that machine learning engineers are more focused on the implementation and I guess bringing a model machine learning model up to the needs of a production scale service that's facing us externally and understanding those challenges so that the machine learning researchers can focus more on action. More on actually developing the model and finding signal for what features would be useful with that in a variety. And so they tend to not be as involved with the actual deployment of infrastructure. If I'm a data scientist or data engineer how should they use it like I am building my features with pandas and just it's safe. Or maybe I'm using your internal is it part of the features to do the transformation. How is it look like. Cool yeah so we have this internal library called fabric here and it's pretty much an abstraction for future engineering and future registry. And we support snowflake created features, pie spark and pandas. And so when it does is you just save the data and transform it in an arbitrary way as long as it conforms to a key value look up idea ideas that we use and we call like entities which is let's say you your consumer idea to right be your entity is like ideas like consumer ID. Like all mode consumer ID like on it and then we use that to create like a key value pair for a specific feature to user or like some other entity like a restaurant ID or something like that. And then as long as it conforms to like a basic data format will it all get saved to s3 and will our feature registry will detect that person. And then we have a separate orchestrating service that kind of will continuously check for features that exist in features and upload that to the store according to the rules set. That's really interesting. So your feature store basically hiding all of the internal implementation of how you store it and how you creating the key and well you store it and provide the data scientist and unify the API. That is connecting unsparently with your feature engineering library is that correct. Yes, yeah, it used to be high in the wild west where I think when I first joined every team was in charge of implementing their own database and storing their own features. And I think it's only like since he enjoyed that there's been more focused efforts create like a similar API for all teams teams. And the whole challenge that I hear that many companies are facing when introducing this new fancy shining technologies is on voting. How do you guys on board new users and you developers to this new platform. Well, I would say it's constantly a challenge figuring out how to best learn more people because there's some in between ground between making things really simple and enabling that these cases. And we actually generally have a pretty white glove approach in general, where are we. We're like very customer oriented and obsessed and that's one of the core engineering values of door dash. And so we try really hard to guide through like new teams and some information. And generally what we found is that you just really need to onboard a couple people in the team. And then the team will take care of itself over time and just maintain contacts. I think the figure challenges trying to keep in tune with customer needs and how the existing use cases don't really fit because I think what happens when you have a service that's been around for a while people just get used to it and figure out ways to hack around it versus. How can we improve this together. So I think the hard part after the onboarding is like figuring out what's wrong with our onboarding is I think the people being on board to tend to be pretty new and don't really have strong opinions on what needs to be changed or done yet. And then I think yeah, I think the hard part is actually just the park after onboarding. And we've also invested a ton into building up documentation, thoughts, strings like tons of tests and pretty much making it as feel for a vast possible but I think it will still happen. But we have a large dedicated support channels and swag. Out with a large army of engineers who are on support. Yeah, that sounds like the story of my life like products market fit and how you iterate around solutions. Yeah, very much awesome. Let's move to the last part and I would like to ask you guys a what's next for this project. What do you see the future of this project. Yeah, I think it's all about staying ahead of the customers needs and then the other part is managing costs like I mentioned we start on ring scales costs is becoming a big factor. And the other and efficiency is a very key as part of how we move forward. And I think the other part that we might look into is in terms of the insights of how features being used across the various use cases as in those insight when we start spending more time of generating figuring out those insights. I suspect that we able to be more innovative of figuring out how to be even more session. And that will help we using the cost and the support load and all that good stuff along with that. Right, maybe you can share what we doing and way you see what we going into 2023. Yeah, so I think we've spent like a lot of time building out a deckler or framework and we've made it super easy to engineer and upload features now. And we've seen what the huge explosion and actual storage costs more like future values being uploaded to service and then so there's been a strong emphasis on efficiency. And then I think yeah, the future aside from improving the efficiency, I think one of the things that kind of like really comes to mind is like when you all you have is a hammer everything wants to inhale. And then so I think expanding like the number of ways that we can store and serve features and expand those use cases, especially when it comes to like some middle ground beach on loan inference. And this concept of taking something from offline to online and making it as seamless as possible. I think because we've acted out so many of engineering components of each storage is that now feature storage has become like an abstract cost or idea that a lot of users not really think about. And I think there is this unexplored area where like whatever it takes to make a model trade wealth is not what you need to actually serve production. And then so there's this gap between what a machine learning practitioner thinks they versus what is actually needed. I think if you think about it, for example, if you look at what would make a model perform well for personalization for you. It's like your entire purchase history and the tire purchase history of every other user that we've ever had last three years. But the reality is that we don't need to store the data of every user we've had in the last three years online future storage. I think the next idea is like figuring out like proactively what users are likely to become back and what we should be storing. And I think like the next story is also how can we create features on the band in a really fast and predictable way because if you really think about there's only a couple different data streams that you actually really need to adjust and everything is some kind of time based transformation on a collection features or applications. And those are something that's like a solve engineer problem to create those in real time. And I think that I that would be like our next big challenge that be pretty excited at the time. Yeah, that's very exciting. We aren't personally hearing many companies talking about making more reactive models and moving more to real time or online, which is basically like what we're doing today in traditional software. Right, when you are just registering to a website, you're getting the confirmation email right away. You don't wait a few minutes for a batch process to happen. So it's very exciting. And I have to say like your solution is very exciting. So I have to ask how you planning to open source it. We have really spent time thinking about that yet, given that we're building on top of very well known technology, a store engine, why not. We will be interested in looking at so many open source solution out there, right? Like feast and other to see of what we can compliment there, but no, not on this one. So you answered your question directly. And now for the question I waited all along all over the weekend and the days and the night. So pricing question in this episode, we're going to do flash questions. I'm going to ask five questions very quickly. You should take turns and answer them just answer the first answer you have in your mind. Are you ready? It's our burger. It's morning or evening. Morning. The other model. What all production or lab? Production open source or fast. Open source. And the last question. 10 so far by torch. All right, that was fun. Thank you guys. And thank you for listening to AI infrastructures. Almok, did you have a good time? Yes, definitely. Amit, how can all listeners connect with us on LinkedIn? You can like our show and invite us to connect. Wait, wait, wait, what about our discord community? Yes, you are welcome to join our community. The link is in the episode description. Wait, wait, wait, wait, wait, what about our Twitter? Yes, please subscribe on Twitter to hear when a new episode comes out. Wait, wait, wait, wait, wait. What about trading us 15 out of five stars on app and Spotify? Yes, please rate us with a nice five star on Apple podcasts and Spotify. See you next episode. Bye bye.