REAL stories about REAL ML challenges from REAL companies. In each episode, our hosts Almog Baku and Amit BenDor, will bring you a different story of real-world stories deploying ML models to production. We'll learn from ML practitioners about the challenges that companies are facing, how they solve them, and what the future bears for us. --- Discord community - https://discord.gg/JcNRrJ5nqW Twitter - https://twitter.com/ai_infra_live Linkedin - https://linkedin.com/company/ai-infra-stories/ Almog - https://www.linkedin.com/in/almogbaku/ Amit - https://www.linkedin.com/in/amitbendor/
Tue, 25 Apr 2023 18:19
In this episode of AI Infra Stories, we had the pleasure of hosting Elad Cohen(Vice President of Core Research at Pagaya). Elad shared with us a common problem that many companies face - research chaos. He delved into the challenges of productizing models in their environment and how they overcame this obstacle through standardization and cookbooks. These approaches allowed for the unification of work and enabled automation and growth. The episode provided valuable insights into streamlining research processes and optimizing productivity.
Join our Discord community - https://discord.gg/JcNRrJ5nqW
Twitter - https://twitter.com/ai_infra_live
Linkedin - https://linkedin.com/company/ai-infra-stories/
Almog - https://www.linkedin.com/in/almogbaku/
Amit - https://www.linkedin.com/in/amitbendor/
Hey, so we are here with a lot of coin today. And laddies, they had co research at pagaya. So lad, tell us very briefly about pagaya and about your world there. Great. Look, so pagaya uses AI to underwrite loans, improving risk prediction and increasing financial inclusion. We offer unsecured personal loans, auto loans, credit lines, and point of sale credit. And we're a publicly traded company based in New York with the R&D as well. As the head of a core research and underwriting, I lead a team of 50 data scientists and engineers focusing on promising research areas and promoting synergy between the different credit asset classes. That sounds pretty cool. What's actually bring you to a young and a male from the beginning? So I learned physics, bachelor's and master's degree. And after working in different research domains about 10 years ago, I found out more about machine learning. And I was really fascinated about the ability to use predictive modeling to improve different businesses. Since then, I've been working in ML. And I want to tell you about some of the challenges I had when taking machine learning into the production skill at a previous employer. That's great. And we gather all to hear about a real life story. So can you tell us a little bit about the challenge and what we're going to talk about today? Yeah. So I wanted to talk about a case I had in a previous employer. And there what the scenario was is that we were making real time decisions. And we had multiple models that were running in production, where each model was running on different segments of the population. Now training these models was a bit of an art. You had to really almost hand craft what is the best possible training set that you want to make. And then you also have to have a holdout test that you want. And I'll give you some of the examples. You might want to exclude different times where the training set had outliers or say different times in COVID that you wanted to remove. You might over sample some of the data to be more similar to what's happening in production right now. And there were a lot of these different decisions that were being made. And eventually someone would run through all these processes, make all these calls. And it could take three or four weeks until we finally had the best model possible, compared it against the benchmark that we had and then moved on. The really tricky part was that with one team that was focused on retraining these models. And we had other data science teams that were working on longer term models and features that would then be able to be utilized by all the different models. So because we had lots of models in production and retraining each of them could take three or four weeks, just because of the capacity we had, what ended up happening is that you would only be able to retrain a model, say once every six months or once every 12 months. And then when we had a new capability that we wanted to implement, then, okay, it's ready. And now we can take six months or a whole year until you've actually gotten the value out of it by retraining all of the models they had in production. So we were challenged to really shrink down that time significantly and get the time to market as fast as possible. And that's what I want to talk about the process. So the problem was mostly the manual part where the data scientist has to work, have to work on the model and then to retrain or was it more automation issue? So it was actually two part problem. The first was that we allowed, I'd get too much flexibility and too much customization. And it was really at the point where it was almost impractical to automate some of the things that were happening. So we really began with actually deciding that we're going to have to limit some of the flexibility in the sake of being reproducible and being able to automate. And then what's almost, okay, these are the set of parameters that you can change. These are the things that you can do. Within this scope, you're going to find the best possible model that you have. And now once we've actually have a framework which dictates what are the, what limits and what constraints do you have, now we can actually go to each of those different components and then talk about automating them. So when we started this really the first part was actually defining what we called cookbooks. So we would say, okay, now go ahead, retrain the model. Train your model, take the three to four weeks that you're going to do it in any case. But everything has to be completely original and reproducible. We want everything to be eventually in one script. It can't be 10 different processes and different notebooks. And you're not really sure what was the order in which you ran them and somebody's basically saving files like notebook zero zero one. So notebook zero zero two. And if one of them is named incorrectly, then all of a sudden you have no idea how you got to the final model. So we started with standardizing that. At a later stage, we once we got that together and we actually had discussions around what are things that you can't do. We actually removed them said, okay, sorry, you know this creates a better model, but I'm going to talk about the trade. I am going to actually decide that it's not worth it in terms of trade off. And then later stage, we actually got to the automation part. That sounds really interesting, but let's maybe take a little bit step back. So how was the process looked like before you had the solution? What was challenging this process? So one of the difficulties that we had was we created like say hundreds of features in house. And the features would actually the feature values would have to be updated after the specific case that were meant we're looking at. Because you had future data that would actually impact the value that you're going to be retraining. So the first thing we had to do was actually run different what we call repopulation. So we'd repopulate all these features with the most up to date data. And because we were a very fast dynamic startup, every time someone would create a new model with new types of features, they would have some kind of repopulation process. And it wasn't very standardized. It wasn't scalable as well. So the data scientist who was going to retrain this would decide on some training set. You would then have to send these out to multiple different repopulation algorithms to get all the different features together. They'd have all of these hundreds of features calculated. And then they could actually start training different models. They would usually train even dozens of different models, do different comparisons. Finally get to one run hyper parameter tuning and then at the end run a benchmark against what we had in production. We actually had a really interesting difficult point there as well because there were different metrics that we were looking at. Because we there's always a case where you want to be a little more conservative and you're running something in production right now. You want to be sure that what you have is going to be better and there's different ways that you can look at a specific model and say, is this necessarily better than what you have today. So just like a really simple example, you might say, I've got a classification model and I can look at the whole or you might say, actually there's four or five segments that are really important from a business perspective. I want to see the you see of each of them. I want to validate that I'm improving or at the very least not degrading on each segment. And we actually created a validation report which had just too much information. We would add more and more tests and we would look at different ways into the data and at certain points that actually became like this conversation where we have a few of the different managers and the different data scientists there and we're all debating business model. This model is necessarily better than the other one everybody's looking at different prisons. So one of the things that we did in parallel was actually define a process and back test to say, you know, we have to come down to one single number. This is the number that we want to finally make a decision on and say this model is necessarily better than the one we had. Because otherwise you can't automate it like you can automate everything else beforehand and then you're still going to have to get this debate going on in order to make a decision and it's not scalable. Especially if you want to retrain, if you want to let's say retrain every one of your models every week. I can't you can't have the 2030 debates every week, making these decisions. So there are really all these different components that we started coming up against and had to solve in this part of this process and they're all took place in parallel. I think it's there's a very interesting notion that coming out of what you're saying and that's something that I'm also preaching in my own company, which is the notion of simplicity, which means something which is might be counterintuitive, which means to limit the flexibility. To look at maybe fewest things as you mentioned and it's like usually as from the engineering side, at least we would like to have as much flexibility to tune everything and also from research research side, but there are too many decisions to make. So it's a big issue, I think. And it seems like it's something that is starting to be more popular as a way of. Yeah, I think it requires a level of maturity to see that because for data scientists, you don't want anyone to take any potential tool away from you. And I want to be able to run any library, any package, try something completely new from scratch, or when you get to a certain scale, you have to start standardizing and need to at some point constrain yourself just so that you are able to get the velocity higher. So I really agree with that. I think that's one of the beauties of standardization, right. That's allow you to not necessarily limit yourself, but to have a common language to communicate stuff like this is how we onboard these packages, this is how we onboard these models or these features. So how did you standardize these processes. So luckily we had really great data scientists who are working there and they were able to decide and suggest what are the cases where it makes sense and what are the right. The packages that they're going to be reusing the things that they are utilizing many times we also had internal hackathons we call them pack of funds where basically we would take code that we're we see that is going to be reused and then standardized it and then put it into our repo and now everyone in the department is going to be using that as the standard. And then training everyone on using that. So in that sense, once everything is tightly crafted into the same functions and it's a bit of a trading trade off. You want the functions to have enough parameters and enough flexibility so everyone will be able to use them and they will go outside of the repo and say, oh, but I can make a way better model if I can just do my own thing and we had those cases sometimes. And then in a few cases, it was almost like, you know, if you think you can do something way, way better than take a week, try it. If you're successful, then we have a problem here and we're going to have to change some of the standardization around our models. Usually it wouldn't happen, but especially there were scenarios when we were facing with a very different type of modeling problem where yeah, the standardization is actually just too limiting and you need the flexibility. So we were able to operate in both of those both of those roles. So how does actually it was looked like data scientists used to write this back at on which I guess was like item functions if I understood correctly and how did they deploy it? How do they learn how to write it? So actually back then it was an R we had our own internal local sea rent server where we would create our packages and deploy them there. Since then it transitioned over to Python and all the repos which over and they created their own functions in Python. And we did have for some of the different packages, basically like a one designated point of contact which was in charge of the design and had to make some more of the architectural decisions because there were in some cases different ways that we could implement and try to create the same functions. So they were in charge of making sure that we're talking we're doing everything in a very kind of standard language, all the functions have the same kind of style code and a style guide and so on. And then if we move after the later later parts when we finally had all the cookbooks and then we were running this, that's the point where the machine learning engineers got really engaged. And we were able to take all the different components and then start automating them into airflow. And there's two different ways to look at this two different metrics that we were trying to reduce. One was just like the gross time. How much time does it take calendar days from the moment I start till I have a model, which in many cases is limited by like the computation. What we cared about more was like the time spent by a data scientist. So if you have to spend five minutes to run a process where you need to do it 10 times and the context switching like ruins your whole day, then that's something that we definitely want to avoid. So we would just count like how many steps were required and how many steps we can reduce. And we actually took the concept of starting with just making it a one click run and we didn't care necessarily if it's now three days of computation, we will that would be a second phase that we could get down. So basically you looked at a model as like a full pipeline of stages and you choose airflow as the way to run all those different stages together. Yeah, exactly. And like the stages are things like I mentioned earlier, it could be different stages for repopulating different features. Then you bring them all together. Now how we choose the training splitting into the training said in the test set and we had to parameterize all this. So now if you're going to rerun this a month later, OK, you shift all the data. Now it's going to be pulling in all the new features that we had. And then finally all the way to the point where we had the training validating running it making getting your single answer. And of course, everything in the process is safe so you can always look more detail and see what it likes. So some sounds really interesting. And I want to ask you also if today at this point of time you could come back to that point of time and rebuild the whole solution. Is there anything you would do differently in terms of processes in terms of how users data scientists are on board. So it's a really good question. I was in a more senior management position. So it's hard for me to talk about the technical decisions that were made and we had a really strong team and I have full confidence that they made the best decisions with the information that they had at the time. I think from my perspective, one of the things that I had underestimated was how long it would take us to actually recruit the machine learning engineering team. So we started with bringing on a team lead. And I had said a really high bar and it took a long time, almost a year until I got the right person and then until he got his team coming in ramped up. Once they were in, then things started running really quickly. And one thing I think went very well, which I would definitely take with me to future future positions is very close collaboration between the machine learning team and the data science team. So we actually had one of the data scientists who was more technically oriented and he worked with them like side by side every day where he would show them what's the process that he's doing, how is he doing it, what are the things that they could automate. And they basically worked as things for three or three months until they were able to automate the first ones and from there, it flew really quickly. And then just a process of, okay, let's get more models and more models onto this framework and then incrementally build more features on it and make it even simpler for the data scientists to use. So that was really important in terms of being able to get very fast adoption. There were some additional open questions that I think we were debating and never completely reached an agreement on one of them is there's the whole concept of if you have a capability of monitoring data. Then it can be a trigger for retraining your model and flip side was okay, but why don't I just retrain it every week or every month. So how does this trigger really help me necessarily because if it's not that maybe the new model is going to be better anyway, even if I don't get a trigger and if not then at worst I wasted some money on US computation resources, but let's find maybe it's easier than setting up a giant monitor. Yeah, that's a good question. I think like optimization of computation is a big point and I think for some companies it's very costly to retrain and maybe not worth it. Also sometimes the model might be better, but sometimes also can degrade because of different phenomena and you might meet it at some point, but maybe a bit later. So that also another point that yeah, that's a whole big discussion, which is very interesting. In our case, we also had a relatively long feedback time until we were able to get labels, so we could take us several weeks and then if you retrain every week, don't have that many new labels. So practically as fast as you can. So I agree it is a question of kind of the computation costs working that out, but for us at least it made more sense at that point to just retrain periodically and not have to run it only the stuff of the trigger. That's really fascinating. I feel like we can talk about this kind of stuff for hours and hours. And there is this question I prepared myself for the whole week for which is surprising question. In this section, we're going to ask you a question that you haven't prepared for. So we're going to do flash questions. You need to just answer it as fast as you can. Are you ready? I'm ready. Now how many data scientists are in your team currently? So in the department today, I think we've got like 70 or 80. How many models do you have? A lot. It's our burger. Definitely burger. Hands down. Production or lab? Production. Exhibust or by torch. If I could choose cat boost, I take cat boost, but you know, you're forcing me between. Thank you very much. It was fascinating conversation. Thank you for coming. Thank you very much. I mean, no, it was a pleasure being here. And thank you for listening to AI infrastructure stories. Almok, did you have a good time? Yes, definitely. Amit, how can all listeners connect with us? On LinkedIn, you can like our show and invite us to connect. Wait, wait, wait, what about our Discord community? Yes, you are welcome to join our community. The link is in the episode description. Wait, wait, wait, wait, wait, what about our Twitter? Yes, please subscribe on Twitter to hear when a new episode comes out. Wait, wait, wait, wait, wait. What about trading us 15 out of 5 stars on app and Spotify? Yes, please rate us with a nice five star on Apple podcasts and Spotify. See you next episode. Bye bye.