Software’s best weekly news brief, deep technical interviews & talk show.
Wed, 24 May 2023 21:00
This week on The Changelog we’re taking you to the hallway track of The Linux Foundation’s Open Source Summit North America 2023 in Vancouver, Canada. Today’s anthology episode features: Beyang Liu (Co-founder and CTO at Sourcegrpah), Denny Lee (Developer Advocate at Databricks), and Stella Biderman (Executive Director and Head of Research at EleutherAI). Special thanks to our friends at GitHub for sponsoring us to attend this conference as part of Maintainer Month.
Welcome back friends this week on the changeover taking to the hallway track of the Linux foundations open source summit North America 2023 in Vancouver Canada. This episode is part of our Maintainer Month celebration along with GitHub and many others check it out at maintainermonth.github.com. Today's anthology episode features Bianlou, co-founder and CTO at SourceGraph, Denny Lee, developer advocate at Databricks, and still a biterman executive director and head of research at eluther ai. The common denominator of these conversations is open source ai. Bianlou and his team at SourceGraph are focused on enabling more developers to understand code and their approach to a completely open source model agnostic coding assistant called Cody has significant interest from us. Denny Lee and the team at Databricks recently released dolly 2.0. This is the first open source instruction following LLM that has been fine tuned on a human generated instruction data set and is licensed for research and commercial use. And still a biterman gave the keynote address on generative ai and works at the base layer doing open source research, model training and ai ethics. We train the eluther ai pithia model family that Databricks use to create dolly 2.0. A massive thank you to our friends at github for sponsoring us to attend this conference as part of maintainer month. Okay, before the show kicks off, I'm here with one of our sponsors at Devcycle, CTO and co-founder Jonathan Norris. So Jonathan, my main question I guess if I'm handing off my feature flags to you all is my uptime dependent on your uptime. Like if you're down, am I down? We've designed into all the SDKs and all the api's app. That's a cardinal rule of the internet. So all the SDKs have been designed with kind of defaults and caching mechanisms and all that stuff in place so that yeah, if our cdn is down or api's are down, it'll sort of fall back to those defaults or those cache values in those SDKs. So that handles for those blips pretty easily. And then we rely on cloud player as our sort of main high load edge provider. So all of our edge api's are through cloud player and they're also operating as our cdn for assets. So obviously relying on a large provider like that that runs such a large percentage of the internet means that yeah, you're not relying on our ability to keep AWS instances running properly. You're relying on sort of cloud player and ability to sort of make sure the internet still works as they control such a large percentage of it. So yeah, we've architected it in a way that it doesn't sort of rely on our api's to be up all the time in our database. It is to be up all the time to have that good reliability. That's good news. Okay. So how do you accomplish that? One of the core sort of architectural decisions we made with our platform. And when we designed it was trying to move the decisioning logic of your feature flags as close to the end user and end device as possible. So we did that with those local bucketing server SDKs that are using sort of a shared web assembly core. And then we have edge based api's that are also powered by web assembly to serve sort of those client SDK usage. So things like web and mobile app. So that's one of our core principles is to try to get that decisioning logic as close to the end device as possible. And this is probably one of the only use cases where a performance really matters because you want your feature flags to load really, really quickly so you can render your website or you can render your mobile app really quickly. And so yeah, we definitely understand that your feature flag and tool needs to be fast and needs to be really, really performance. So if you want a fast feature flagging tool that's performance and is not going to impact your uptime, check out our friends at Devcycle as Devcycle.com slash change all pod. And for those curious, they have a free forever tier that you can try out and prove yourself and your team. This is going to work for you. So check it out, devcycle.com slash change all pod. And tell me sent you. So Cody. Yeah. Cody. What's the big deal? We think it is. Yeah. Yeah. So this isn't at Source Graph 4.0 last year was relaunched as the intelligence platform. Yep. Is that right? Because before, not just, but just code search, which was cool, but hard to really map out the ecosystem. And you want all the space in there, but there was a limit to code search. And you had to expand and the insights and the intelligence. And now obviously Cody is just like one more layer on top of insights. Yeah, totally. So, as you know, like, Source Graph historically has been focused on the problem of code understanding. Right. So heavily inspired by tools like code search inside Google or TVGS inside Facebook. Right. This kind of system that indexed your company wide code base as well as your open source dependencies and made that easy to search and navigate. And that's what's been powering the business for the past 10 years. This is actually, you know, the 10th year of the source. Congratulations. I was just wondering about that. Wow. Yeah. When we first met you, it had to be about a decade ago. Yeah. Source Graph just either didn't exist or just had existed. Source Graph exists when we met. This was like Go for Con. I think it was like 2014 or second Go for Con. Go for Con. Yeah. And you had this vision of, you know, Source Graph. And I'm wondering 10 years later, like, have you achieved that vision as the vision changed, et cetera? You know, our mission was always to enable everyone to code. And we actually took a look at our seed deck. Oh, recently, you know, it kind of tripped down. It was a point. Lane is very quaint. We're very bad at power point. You were probably a lot better at it now. Not really. No, but better at the pitch, maybe. Maybe. You were finding it slightly. We were finding it. But largely, like, I could deliver that pitch today off that deck. It's basically the same. No, right now. I mean, it's just the pitch of Source Graph, which is like there's never been more code in the world. Most of your job as an engineer or software creator is understanding all the code that already exists in your organization. Yeah. Because that is all upstream of figuring out what code you want to write. And then once we actually figure out what you need to build, like, that's almost the easy part. It's also the fun part, right? Because you build new things and chip and stuff. But we help you get to that point of, you know, creation and enjoyment by helping you pick up all that context. So traditionally, that's been like search, right? Just like Google's been web search. But then these large language models have now come on the scene. And in some ways, they're disruptive to kind of like search engines, but in other ways, they're highly complimentary. So, you know, anyone who's used chat to be doing. Absolutely. I just, less, just less, right? It's more like the last thing you do when you can't get the answer elsewhere. Right. Yeah, I guess I'll go Google it. Yeah, although I technically... Google is a weird thing because I will search a product and they think I want to buy it, not research it. Right. I want to learn about the thing and those who are teaching about the thing and how it integrates other things, not where can I buy it and for how much. Yeah. So there's like zero context there. Like they're incentivized. It seems to point you to place that you can purchase it. Yeah. Not learn how to use it. Yeah, yeah. I mean, I think there's an interesting... Which is the opposite of chat GBT. Yeah. So there's, there's kind of like pluses and minuses to both, right? Like with Google, you get results to actual web pages and you can kind of judge them. Based on the domain, it's kind of like more primary source material, which is useful. It's also live. You know, you get results from 2023 rather than 2021. Sure. Whereas chat GBT. That'll change. That's a temporary thing, right? I mean, well, the delay will be temporary. Eventually, it'll catch up. Well, I mean, GBT 4 is still... It came out recently in 2021, right? It's still 2021. Right, but isn't the plug-ins and all that stuff where it's like, okay, the model is old, but it has access to data. Yeah, so the plug-ins is actually where it gets interesting because that's where things get really powerful, in my opinion, because if you ask chat GBT with the plug-ins enabled, it can go browse the web on your behalf. So it's not just the base model, you know, trying to answer your question from memory anymore. It's actually going stuff in, you know, essentially googling for things, right? Yeah, it's right. It has access to what you would do. Behind the scenes, the exactly. Exactly, yeah. Exactly. So it's the best of both worlds and essentially we're doing that with Cody, but in your editor for developers. So basically combining large language models like GPT-4 or N-ProPX-Clawed model and then combining that with power with the most advanced code search engine in the world. So it's the best of all worlds. It gives you highly context-aware and specific answers about your code and it can also generate code that's kind of tuned to the specific patterns in your code base, not just the kind of like median stack overflow or opensource code. How did you get there? How did you think? Obviously, LMS are a big deal, right? This new wave of intelligence that we have access to is how far back is this in the making? Is this been years or has it been like, wow, chat GPT is crazy? No, Vember. Chat GPT-3 is a bit of a different thing. Okay, we got to move. How far back is this go? Yeah, good question. Yeah, so for me personally, it's kind of a bit of a homecoming. So like my first interest in computer science actually was with machine learning and artificial intelligence, that's what I did. A lot of my undergrad doing is actually part of the Stanford AI lab doing vision research in those days under Professor Daphne Coler, she's my advisor. And so I did a lot of work there. It was super interesting and I felt really passionate about it. There's just a lot of elegant math that goes into things and it feels like you're kind of like poking at some of the hidden truths of the universe a little bit. But the technology at that point was just, it was nowhere near commercializable. And so I decided to pursue my other passion, which is developer productivity and dev tools. And kind of like stayed on top of the research as it was coming along. And I think one of the inflection points for us was the release of GPT-3 because that was kind of a step function increase in the quality of the language models. And we started to see some potential applications to developer tools and code. And we really started in earnest maybe a little over a year ago, maybe 12 to 18 months ago. Experimenting with the kind of like internal representations of the language models as a way to enhance code search. So we actually put out an experiment called code search.ai that uses embeddings, enhance quality of code search results that you get. And that was pretty successful as an experiment. I think we released that probably middle of last year, so about a year ago. And that kind of started us down the road. And then of course when ChattabeeD came out, that was also another big inflection point. And that's when we started to think very seriously about kind of like a chat-based interaction that could happen in your editor, have all the advantages of ChattabeeT, but know about the specific context of your code. Right. And so for codey specifically, I think first commit was December 1 or something like that. And by February we basically had a version that we're having users and customers try. And then March was when we rolled out to our first enterprise customers. So it's just been like this whirlwind of development activity. And I don't know, I cannot remember a time where I've been more excited and just eager to build stuff. Because we're living through interesting times right now. That is. Yeah. This is the Eureka moment that we've all been waiting for basically. Right. I mean, this is the invention of the internet all over again, potentially the iPhone level. Yeah. It's a dramatic paradigm shift in how we think. As engineers and software developers, like how do we learn? How do we leverage? How do we augment? Yeah. You know, it's just insane what is available to somebody who doesn't have an understanding to quickly get understanding. And then be, you know, performing in a certain task or whatever because of the LLMs that are available and how it works. It's so crazy. Yeah. The chat interface is pretty simple though, right? The simplicity of a chat interface, did you expect this Eureka moment to be simply chat? Like as you've been, I mean, like it's a web app. It's not something else. It's a web interface. It's a chat interface. I think so. So, you know, I'm a programmer by background. So I've been like pushing, I've been trying to spread the gospel of textual based input for you know, as long as I can remember. Yeah, obviously it's mostly fallen on deaf ears because, you know, the non-programming world is like, you know, command line. That's what we in like the 1980s. But I actually think, philosophically, like textual input, the reason I like it is because if you think about just like the IO, like bitrate of human computer interaction, it's like, we live in a time where like, we have 4K screens running at, you know, 60 or 120 hertz. Like the sure amount of like data that computers can feed into us through eyeballs is huge. Whereas in kind of like the point click, you know, mouse world, it's like, how many bits per second can you really feed into the computer as a human? Right. And now textual input, you know, doesn't get us all the way there to, you know, 4K times, you know, 60 hertz. But it does, it basically like 10Xs or more like the input bitrate of what we can do to instruct machines. It's a great win for kind of like human agency. Like we want to be programming the computer is not the other way around, right? And I think a lot of the technology that has emerged over the past, you know, 10, 15 years has been kind of computers programming us as humans a little bit in terms of like all the stuff that we consume. And so yeah, I'm super excited for textual based inputs. I think chat is kind of like a subset of that. The way we think about Cody evolving is really it's going to evolve in the direction of just like this rich repel. So it's not necessarily going to be like, oh, it's a human like thing that you talk with conversationally. It's more like if you want to do a search, you type something that looks like a search query, it knows that you want to do a search, shows you search results. If you ask a high level question, it knows that you're asking a high level question, it gives you an answer that integrates the context of your code base. If you want to ask a question about your production logs or maybe something about something someone said in chat or like an issue or code review, should pull context from those sources and integrate that and both synthesize an answer to your specific question but also like refer you back to the primary sources so that you can go and dig deeper and understand more fully how it got to its answer. So we think chat is just the starting point. It's really just like this rich repel that's going to integrate all sorts of contexts, like whatever piece of information is relevant to you creating software. This is kind of like the thing that focuses that and pulls it all in. It really seems like that at least as an interface, you're seeing that as the future of what source graph is, isn't it? Or is there more to source graph than that in the future? So the way we think about it is like we spent the past 10 years building the world's most advanced code understanding tools. So we have the best code search, we have the best code graph. So the global reference graph across all the different languages in the world, we have a large scale code modification, refactoring system and a system to track high level insights. So there's all these like back and capable abilities that are really, really powerful. And what language models have done is given us a really, really nice beginner friendly interface to all that power. And I think you're going to see this across all kinds of software. It's like historically building power user tools has been difficult because the on ramp to getting full, taking full advantage of those tools has been a little steep. As a education. Yeah. So if you're worried about the on ramp, maybe you end up constraining your product a little bit just to make it simpler, dumb it down for the beginning user, but you lose out on the power. I think that tradeoff is no longer going to be as severe now with language models. And so at source graph, we're basically thinking, rethinking the user interaction of the entire experience. Like the underlying capabilities underlying tech is not changing. That's still, if anything that's gotten more valuable now, because you can feed it into the language model and instantly get value of it. But the entire user interaction layer, I think needs to be rethought. And Cody, as your AI editor, assistant, is the first iteration of that thought process. How did you iterate to the interface right now? And is it a constant evolution? Yeah. I mean, it's pretty much like, hmm, I think that would be a good idea. Let me go hack it together and see how it plays. And you play around with it. And then you kind of experience it yourself. You build conviction in your own mind, and then you maybe share it with one or two other teammates and see if they have the same wow moment. And if they do, that's usually a pretty good sign that you're onto something. And there might be more details to hammer out to make it more accessible to everyone. But if you can convince yourself, and at least like two or three other smart people out there that there's something worth investigating, I think that's typically a pretty good sign that you're onto something. How do you get access to Cody? Not so much. Get access, but how do you use it in the source graph world? Like how does it appear? How do you conjure it? Yeah, so it's just an editor extension. You can download it from the VS Code marketplace. It's available now, and it's free to use. And we have other editors on the way, IntelliJays, very high priority for us, also NeoVim, and of course, my editor of Choice Emax. Of course. And we're developing it completely in the open as well. So Cody itself is completely open source and Apache licensed. And to get access to it, to start using it, you just install the extension into your editor and start using it. It opens up in the sidebar, you can chat with it. We also do inline completions. So as you're typing, we do complete code. Again, take an advantage of the baked-in knowledge of the language model plus the context of your specific code base. So generating very high quality completions. And yeah, it's generally just as simple as installing the extension, and then you're off to the races. Probably a source graph account first. Right? Yeah, so you do have to off through source graph, because that's how we, I mean, we wouldn't be able to provide it for free if you didn't off through source graph, because on the back end, we're calling out to different language model providers, and we're also running a couple of our own. Okay. So accessible then. But having to install source graph, have it scan my repository, like the traditional way you provide intelligence, which is to leverage literally source graph on my repo. I can just simply off through source graph, have it an extension in my VS coder in the future EMAX. Exactly. They're in potentially. They're kind of loosely coupled. We don't believe in strong coupling just for the sake of selling you more software. And I think with Cody, the design philosophy was like, look, if you connect to source graph, it does get a lot better. It's like if you gave a really smart person access to Google, they're going to be a lot smarter about answering your questions. Yeah. But if you don't give them Google, they're still smart person, and so Cody will still fetch context from kind of like your local code using non-source graph mechanisms, if you're just running it standalone. Yeah. How does it get this intelligence as an extension? Like, how does that explain how that works? Like, I've got it on my local repo. Yeah. How does it get the intelligence from my code base? Yeah. So it's basically, I mean, think of the way that you would like understand or build a mental model what's going on in a code base as a human. You might search for some pieces of functionality. You might read through the readme. Click on a couple search results. It does all that. It's reading my readme right away? Yeah. Basically. So when you ask a question, Cody will ping source graph for, hey, what are the most relevant pieces of documentation or source code in your code base? And then essentially, you know, quote unquote, read them as a language model and use that as context for answering question. So if you ask like a general purpose question, it'll typically read the readme. If you ask a more targeted question, like, oh, how do you do this, this, you know, one specific thing, like, you know, read a PDF or whatever, it'll go find the places in source code where you're, you know, it, it processes PDFs and read that in and then interpret that through the lens of answering question. Real time. Yeah. Is there a latency to the question to the gathering and like, what's the speed? If I said, yeah, that example, how does my application, you know, compile PDF from a markdown file, for example. Yeah. So typically gets back to you within like one or two seconds. And most of the latency is actually just the language model latency. So it depends on what language model you're choosing to use underneath the hood. All the source graph stuff is super fast because that's just, I mean, there's no like, yeah, it's, the source graph is fast. We spent the past 10 years making it very fast. And there's no like, you know, billions of linear algebra operations happening with source graph. Source graph is just, you know, classical, you know, CPU based code and text. What about privacy? Yeah. So privacy is extremely important to us both in terms of, you know, individual developers and our enterprise customers. Like, the last thing they want to do is have their private code be used as training data into, you know, some general purpose model that's going to leak their sensitive IP to the rest of the world. So we basically negotiated zero attention policies with all our proprietary language model providers, which means that your data is never going to get used as training data for a model. And not only that, the language model providers will forget your data as soon as the request is complete. So like, there is no persistence in terms of like, you know, remembering the code that you sent over to complete request, that just gets forgotten as soon as the language model generates a request for Cody. And then for the rest of it, I mean, source graph has always taken user privacy and code privacy very seriously. It's why we've been able to serve the sorts of enterprise customers that we do. For sure. I know why that's important, but why spell it out? Why is that important? What's your, this zero attention policy? What's the real breakdown of that privacy? Why is it important to the menu users? So from, from a company's point of view, it's important because you don't want to leak portions of your code base or have them persist in the logs as some third party data provider. As an individual developer, I think it's just important to give you control over your own data. And I think that's going to be an especially important thing in this new world that we're living in where, you know, before private data was valuable, you know, it carries value. It tells you things about a certain person or the way they work and it can be used for, you know, purposes both good and bad. Search history. It's like search history, right? Exactly. You can tell a lot about a person by their search history. Yeah, exactly. They're like history. Totally. Well, it's useful for a whole other reason, right? Yeah. And I think it's important to grant our users and customers control and ownership over that data because it is your data. And I think with language models, like language models, just, they like 10X, the value, and the sensitivity of that data. Because now instead of, you know, just like feeding it to like a Gen 1 AI model or exposing it to some other human, you can feed it into one of these large language models, like, you know, kind of like memorize everything about you as a person or a programmer. And you know, in some ways, maybe that's good. Like if you're open to that, if you're willing to share your data, we could potentially train language models that, you know, emulate some of the best and brightest programmers in existence. But we ultimately, we think that should be, you know, your personal, yeah, right. Talked exactly. Talked exactly. And let's see is that in the sign up or the acceptance of the Cody license or the, you know, this GA to now, you know, widespread use it. How do you, how explicit are you with a new sign up to says, I want to use Cody, do you say privacy and all these things you just said basically? How clear is that? Uh, so when you first install it, there is kind of like a terms of use that pops up and you cannot use Cody unless you accept it. You read through and accept it. How many words is in that, that, that, the OS? It fits on like basically one page without scrolling. Okay. So 1000 words maybe. Uh, 500. Yeah. 250. Maybe not 250. I think it's probably 250 to 500. Uh, I had to go back. I checked specifically, but like digestible in a minute. Yeah. We're, we're not trying to be one of those companies that tries to hide stuff. What I mean by that is, let's try to say, are you hiding it, but more how clear are you being? Because it seems like you care to be clear. Yeah. Like a paramount thing for you all to be so clear that you say, hey, privacy matters. Yes. We don't collect zero retention. It's spelled out really clear. It's a bullet list saying basically saying what exactly you said privacy matters. We don't collect data and I wrote it for you. We're not using. Yeah. Basically. Well, Tammy are wonderful legal counsel. I didn't write it. I didn't write it. I'm just kidding. We on our chat GBT wrote it. Okay. She says it. Actually, you know, that's a great use case for, for chat GBT. If you're asked to accept one of these like lengthy and user, they're summarized. They're summarized. They're summarized it. Yeah. Telling you those things fishy. Yes. That's cool for sure. That's the best. I cannot wait. Honestly, for that to come out. What are the loopholes in this contract? I have nefarious action on the other side. What are my loopholes to get out? Right. You know what I mean? Yep. They're better good. I guess you can use that in the bad side or the good side. But like GBT for X, where X is literally everything. Right. Yeah. It's going to be there. It's going to be a relatively trained for lawyer, lawyerine. Yeah, yeah. I think, you know, language models will be a huge democratizing force in, in main domains. You know, it's democratizing, understanding of, of legal concepts, democratizing access to software creation. I think it's going to be, it's going to be a huge expansion of the percentage of people that's going to be able to access those knowledge domains. Right. So let's say I'm a happy GitHub co-pilot user. Mm-hmm. Oh, yeah. I'm going to install Cody alongside this and be happier. Would I be less happy? Are these competitive, like, is this a zero sum game? Do I need to go all in on Cody? What are your thoughts on that? I think it's the exact opposite of a zero sum game. Okay. I think there's like so much left to build that, you know, the market is huge and vastly growing. We do have features that co-pilot doesn't have. So, you know, currently they don't have kind of like a chat-based, you know, textual input to ask high level questions about the code. I think that's coming in co-pilot X to some extent, but it's not out yet. It's not out yet. If you look at the video, the kind of context fetching they're doing, it's basically, you know, your curly open file explained that and Cody is already doing much, much more than that. It's reading, even if you ask it a question about the current file, we'll actually go and read other files in your code base that things are related and use that to inform your answers. So, we think, you know, the power of source graph gives us a bit of a competitive edge there with the kind of high level questions and onboarding and kind of like rubber ducking use case. And then for completions, you know, I think co-pilot is great, but for completions, we're essentially doing the same thing. So, like, the completions that Cody generates, it takes into account that same context when it's completing code. So, that means it's better able to kind of mimic or emulate the patterns and best practices in your specific code base. And again, because we're kind of open source and model agnostic, we are just integrating all the best language models as they come online. So, I think, you know, in throttic, I don't know when this episode's going out, but in throttic today just, okay, pretty quick, the 24th. Yeah, so in throttic just announced today that they have a new version of Claw that has an incredible, like, 100,000 token context window. It's just like a, wow. I think, like, orders of magnitude more than what was previously available. And that should be, I mean, by the time this episode goes online, it should be available in Cody. Whereas, you know, co-pilot, I think there, like, maybe someone from GitHub can correct me if I'm wrong, but I think they're still using the Codex model, which was released in like 2021 or something. And so, it's a much smaller model that only has around like 2,000 tokens of context window and much more basic context fetching. It's already incredibly useful, but I think we're kind of taking it to the next level a little bit. So open source and model agnostic. Open source, model agnostic. We're not locking you in to like a vertical proprietary platform. Prophecy friendly. Prophecy friendly. Also enterprise friendly, you know, source graph, we made ourselves easy to use in both Cloud and on-premises environments. So we're just trying to do the best thing for our customers and for developers at large. So because you're model agnostic, does that mean that you're not doing any of the training of the base layer models? So do you also side step legal concerns? Because I know like with Codex and co-pilot, there's been at least one high profile lawsuit that's pending. Like there's legal things happening. There's going to be things litigated. Yeah. And I'm wondering if you're in the target for that now with Cody or if you're just not because there's other people's models. Now, we're very mindful of that and we actually integrate models in a couple of different ways. So we do it for kind of like the chat-based autocomplete. There's a separate model we use for code completions and there's another model that we use for like embedding based code search and information retrieval. And it's kind of like a mix and match. Like sometimes we'll use like a proprietary off-the-shelf model. Other times we'll use the model that we fine-tune. But for the ones that the models that we do rely on external service providers for. We're very mindful of the kind of like evolving legal and IP landscape. And so one of the things that we're currently building is basically like copyright code or copied code detection. And if you think about it like source graph as a code search engine is kind of like in a great position to build this feature. It's like if you emit a line of code or you write a line of code that is verbatim copied from somewhere else in open source. Even in your own proprietary code base, you might be worried about code duplication. We can flag that for you because we've been building code search for the past 10 years. Yeah. Cool stuff man. So moving fast. What comes next? When you got to drop code 2 it's probably like a week from now right? Yeah, that's a great question. I mean we are just kind of like firing all on all cylinders here. We have a lot of interesting directions explore like one direction or one dimension that we're expanding in is just integrating more pieces of context. So one of the reasons why we wanted to open source code is because we just want to be able to integrate like context from wherever it is and not be limited by single code host or single platforms. Like there's so much institutional knowledge that's in many different systems might be in Slack. It might be in GitHub issues. It might be in your code review tool. It might be in your production logs and so we want to build integrations into Cody that just pull in all this context and I think the best way to do that is to make this kind of like platform this orchestrator of sorts like open source and accessible to everyone. Yeah. The other dimension that is very exciting to us is going deeper into the model layer. So we've already started to do this for the embeddings based like code retrieval. But I think we're exploring some models that are related to code generation and potentially even like the chat based completions at some point. And that's going to be interesting because it's going to allow us to incorporate pieces of source graph into that actually like training process. And there's been some research there that shows that incorporating like search engines into training language models actually yields very nice properties in terms like low latency but higher quality. And it's also important to a lot of our customers because a lot of them are large corporations. They deploy on premises and even the zero retention policy where the code is forgotten as soon as it's sent back over is not good enough for some of our customers. So they want to completely be able to self host this and we plan to serve them as well. How high up the stack, like the conceptual stack do you think Cody can get or maybe any AI tooling with code gen with regards to like how I instructed as a developer. Yeah. And right now we're very much like, okay, it's autocomplete. There's a function here, right? I can tell it, write me a thing that connects to an API and parses the JSON or whatever. And it can spit that out. But like how high up the stack can I get? Can I say, you know, write me a Facebook for dogs and be done? For instance, are like user stories, kind of like user stories and go from there. What do you think? That's a great question. I mean, we've all seen the Twitter demos by now where someone is like, you know, GPT 4. And you know, it's a working app and a website. I think if you actually gone through and tried that and practiced yourself, you soon realize like, hey, you can get to like a working app pretty quickly just through like instructing it using English or natural language. But then you get a little bit further down that path and you're like, oh, I wanted to do this. I wanted to do that. Can you add this bell and whistle? There's kind of this like commentorial complexity that emerges as you add like different features and you're kind of diverging from like the common path. And then it falls apart. Like I actually tried this myself. Like I tried to write a complete app is actually a prototype for the next version of Cody. Okay. I tried to do it by not writing a single line of code just by writing English. And I got like 80% of the way there in like 30 minutes and I was like, this is amazing. Like, this is the future. Like I'm never going to code again. And then the remaining 20% literally took like four hours and I was banging my head against the wall because I asked to do one thing and then it did it. But then it kind of like screwed up this other thing and it became kind of like this like whack-a-mo problem. So we're not all the way there yet. But I think the way we think about it is like Cody right now is at the point where if you ask it, there's another thing I tried the other day. Like I wanted to add a new feature to Cody. Cody has these things called recipes which are kind of like templated interactions with Cody. So like write a unit test or generate a doc string or you know, smell my code. You know, give me some feedback. Yeah. So I started to add a new recipe and I basically asked Cody, hey, I want to add a new recipe to Cody, what parts of the code should I modify. And it basically showed me all the parts of the code that were relevant. And then it generated the code for the new recipe using the existing recipes as like a reference point. And I basically got it done like five minutes and it was amazing. So like I was still obviously in the hot seat there. I was still calling the shots. But it turned something that probably would have been at least 30 minutes maybe an hour if I got frustrated or distracted into something that was like five minutes. And that was actually the interview question we were using for interviewing on the AI team. So after that we had to go back and like revamp that. It's like just too easy now. Everything just got easier. Yeah. Do you think this is like a step change in what we can do and then we're going to plateau right here for a while and like refine and you know do more stuff but kind of like say at this level of quote unquote intelligence or do you think it's like just the sky is the limit from here on out like which I mean obviously just conjecture at this point. Challenging to predict. I mean it's very challenging to predicts. You know I might be eating my words in another six months but like you know on the spectrum of you know oh it's just like glorified auto complete and it doesn't really know anything to all the way to like you know AGI doomer you know let's nuke the GPU data centers. Right. Oh my gosh. I already fall. Yeah I give him ideas. Cancel cancel cancel. Honestly I think a lot of the discourse on that in the spectrum has just gotten kind of crazy. Yeah. Like the way the way I view it is this is a really powerful tool. It's an amazing new technology and you know it can be used for for evil certainly as any technology can but I'm a techno optimist and I think this will largely be like positively impactful for the world. And I don't really see it you know replacing programmers. It might change the way we think about programming or you know software creation. There's certainly going to be a lot more people that are going to be empowered to create software now. And I think there will be kind of a spectrum of people from you know those who write software just by describing it in natural language all the way to the people who are kind of like building the core kernels of kind of like the operating systems of the future that formed like the solid foundation that you know pack in the really important you know data structures and algorithms and core architecture around which everyone else can you know throw their you know ideas and stuff. So there will be like a huge spectrum I think you know we almost think of it in terms of like the way we think of like reading and writing now where like you know that you have many different forms of reading and writing like this people just like reading writing stuff on Twitter you know that's that's one form of writing and then there's other people who write you know long books right that span you know many years of intense research and I think the future of code looks something like that. It's the ultimate flatener you see that book the world is flat yeah yeah it's like that like for a while there it was outsourcing and now it's sort of like just accessibility to everybody now. You know people who don't know much about code can learn about code and level up pretty quickly and so the access the catered access to have a patient whether person or not like I have conversations with chat GBT yeah and I swear I'm like I tell my wife I'm like I'm literally talking to a machine and I get it but we 30 40 rounds back and forth through whatever it might be and it's very much like a conversation I have with Jared if you would give me the time and patience and if you wouldn't get frustrated you know what I mean. So it's a better friend and I am. I have this very patient yeah well not necessarily but you know the world now has access to a patient yeah side car yeah it's quite intelligent that will get even more intelligent whether you call it artificial intelligence or not yeah you know it has intelligence behind its knowledge yeah and it's successful right now I agree humans are still necessary thank you Lord but wow it's super flat now and a lot more people have access to what could be so it might be because of this and that's a fantastic thing I think of you know there's that Steve Jobs quote where he said computers are amazing because they're like a bicycle for the human mind yeah they allow a much more I think it's drawing comparisons to like you know how different animals get around and like a human walking is like very inefficient but a human on a bicycle is like more efficient than like the fastest cheetah or whatever right I think like what what language models are capable of doing is instead of like a bicycle now we each have like a race car or a rocket ship now we're still in the driver's seat right like we're still steering on telling you where to go but it's just it's way more leverage for any given individual so great thing if you know you love being creative you love dreaming up you know new ideas and ways to solve problems one more question the business out of things how has growth been because of Cody that's a great question Cody is I you almost would not believe it if I described it to you but Cody is literally like the most magical thing to to happen to the source graph go to market or sales motion since basically when we started the company ever basically I've been paying attention for was wise that question yeah you had trouble getting growth because you got to install server yeah yeah and you got to examine the code base you got to learn how to search the code yeah which is all like friction points so what so one of the like transparently one of the the challenges that we had as a business is you know we we had a couple of sub subsets of the programmer population that were were very eager to adopt source graph is basically if you use a tool like source graph before you want to use it again so you're an ex-googler ex-face booker extra boxer or you know ex-microsoft are at you know in in a couple teams you kind of got it immediately and then everyone else is like oh is it like grip or is it like control and we would lose a lot of people on the way I think with Cody it's it's at the point where not only does any programmer get it right away they're like holy shit like you know you just asked to explain this like very complex code and in English and gave me like really good explanation even like non technical stakeholders so like as we sell to large and larger companies a lot of times you know in the room is is someone with like a I don't know a CEO or like board of directors or you know a non technical someone who's pretty distant from from the code right traditionally speaking and they get it too because you know we were in a pitch meeting the the other week where it was like a large kind of fortune 500 energy company and there was not a program in the room is just kind of like you know high level business owners who are all very skeptical until we got to Cody you know we opened up you know one of their open source libraries and asked Cody to explain was going and one person leaned in and they're like you know I'm I haven't coded in like 30 years and even I would give value out of this so yeah it's it's just absolutely incredible your total adjustment market got a lot bigger yeah yeah yeah yeah because like what is an engineer now I think it's like it in in a couple years uh almost every human in the world will be empowered to create software and in some some fashion you said before that Cody leverages all that source graph is today the intelligence yep will that always be true I guess is maybe the more basic way to answer that or ask that question because at some point if this is the you know the the largest arc in your hockey stick growth yeah and all the up from here is you know not so much Cody related but Cody driven really yeah does what source graph do at large now eventually become less and less important and the primary interface really is this natural language Cody interface that explains my code that's a great question it's like you know does does AI just like swallow all of programming at some point like at some point do we cease to write uh kind of like old traditional like systems oriented uh software in the Von Neumann tradition you wrote that code yeah what you wrote a for loop instead of just like asking it nicely to repeat some nicely forget code search I don't even read code like why are you reading code a lot of own searching right yeah I you know it this is still very early days so uh it's very difficult to predict but the way I think about it it I think about it in in terms of like maybe we have there are different types of computers that can exist in the world like a traditional you know like PC that's one type of computer you can maybe say like the human brain is another type of computer um and then these language models I think they're they're a new type of computer and they do some things a lot better than you know the PC type of computer did uh and then some things much worse like they're far less precise um I think I saw a tweet the other day where someone repeatedly ask you know GPD4 whether you know four was greater than one and then at some point GPD4 got unsure of itself and said oh no actually I was mistaken you know one is greater than four I apologize yeah exactly yeah so I apologize so I think these two types of computers are actually very complimentary and so like the most powerful systems are going to be the ones that combine both and feed the inputs of one and the outputs of the other yeah uh and and synthesize them in a way that's truly powerful and and we're already seeing early examples of this like Cody is one you know we use kind of like the the chomp ski style like code understanding the tech with the more norweg style you know language models um being search is another you know where uh they're using chat GPT uh for for the AI part of it but they're still relying on kind of traditional being web search and so I think we'll see a lot of hybrid systems emerge that combine the best of both worlds yeah exciting times thanks for talking to us yeah thanks for having me on good seeing you again good talking pleasure chatting with you yeah that was fun that's exciting you guys are good at this excited for you so in the sponsor me so here in the breaks I'm here with Tom who dev advocate at century on the codecuv team so Tom tell me about centuries acquisition of codecuv and in particular how is this improving these century platform when I think about the acquisition when I think about how to century use codecuv or conversely how to codecuv you century like I think of codecuv and I think of the time of deploy when you yourself a developer you have your life school you write your code you test your code you deploy and then your code goes into production and then you sort of fix a bugs and I sort of think of that split in time as like when you actually do at the play now where codecuv is really useful is before deploy time it's when you are developing your code it's when you're saying hey like I want to make sure this is going to work I want to make sure that I have as few bugs as possible I want to make sure that I thought of all the errors and all the edge cases and whatnot and century is the flip side of that it says hey what happens when you hit production right when you have a bug and you need to understand what's happening in that bug you need to understand the context around it you need to understand where it's happening for the stack trace looks like what other local variables you know exist at that time so that you can debug that and hopefully you don't see that error case again when I think of like oh what can essentially do with code cover what can codecuv do for century it's sort of taking that entire spectrum of the developer life cycle okay what can we do to make sure that you ship the the least buggy code that you can and when you do come to a bug that is unexpected you can fix it as quickly as possible right because you're not as developers we we want to write good code we want to make sure that people can use the code that we written we want to make sure that they're happy with the product they're happy with the software and it works the way that we expect to if we can build a product you know this century plus codecuv thing to make sure that you are de-risking your code changes and de-risking your your software then you know we've hopefully done to the developer community as service so time you say bring your tests and you'll handle the rest bring it down for me how does a team get started with codecuv you know what you bring to the table is like testing and you bring your cover reports and what codecuv does is we say hey give us your cover reports give us access to your code base so that we can you know overlay code coverage on top of it and give us access to your CI CD and so with those things what we do and what codecuv is really powerful at is that it's not just hey like this is your code coverage number it's hey here's a code coverage number and your viewer also knows and other piece parts of your organization know as well so it's not just you dealing with codecuv and saying I don't really know what to do with this because we take your code coverage we analyze it and we throw it back to you into your developer workflow and by developer workflow I mean your pull request your merge request and we give it to you as a comment so that you can see oh great this was my code coverage change but not only do you see this sort of information but your viewer also sees it and they can tell oh great you've tested your code or you haven't tested your code and we also give you a status check which says hey like you've met whatever your teams decision on what your code coverage should be or you haven't met that goal whatever happens to be and so codecuv is particularly powerful and making sure that code coverage is not just a thing that you're doing on your own island as a developer but that your entire team can get involved with and can make decisions very cool thank you Tom so hey listeners head to century and check them out century dot IO and use our code change log so the cool thing is is our listeners you get the team plan for free for three months not one month not two months three months yes the team plan for free for three months use the code change log again century dot IO that's scn tr y dot IO and use the code change log also check out our friends over at codecuv that's codecuv dot IO like code coverage but just shortened to codecuv codecuv dot IO enjoy so now we're now we're fine tune here okay I see what you did there swine tune I think is what you're trying to say well no I think it was a dolly reference fine tune so yeah it's a pun it was a pun work work work with the Jared I mean I already on the same page what the heck man Adams puns are on point always he never misses with the pun all right thank you all right so we have Denny Lee yes from data bricks or data bricks data bricks data bricks yes is that the official stance or American thing it's just data bricks it's just data bricks here to talk about dolly to but first yes here you're a just in time conference presenter tell us what this means well I think the the context was that we you're asking me hey what's your presentation that's what you asked me first I did I was actually responding I don't remember the name nor do I remember I do remember the concepts at least I do have that part but I don't remember the name nor nor are the slides done yet and and this is normal it starts in 30 minutes no no no no no no no tomorrow tomorrow okay I'm just simply saying that it is common for me to go ahead and not do a thing until 30 minutes before the actual presentation to create the slides so you're a procrastinator yes I'm a very good one I'm not that's not procrastination no if it's optimization yeah why are if it's why sweat over the details until you have to exactly because what if you start 30 minutes before but you realize the details required 45 minutes so I had just one time where actually but in my Thomas Kaiser he and I went ahead and did the presentation where he so he's from Denmark I'm from Seattle we're both in I don't know where some other city to do the presentation some of the world some of our in the world so we actually got together but we realized we actually hadn't done squat on the slides until 30 minutes before the actual session and guess what 30 minutes before put together the slides bam we're good to go so has it ever beat you I'm sure tomorrow I'm sure at some point it will bite me I guess the context is I've gotten away with it so far so I'm going to go with it and enough times that you have full confidence yes fair enough yes or at least I know how to fake it so what would you like to know about dolly about dolly want well how we came about with all the one that I was all the two let's start with why all right let's start with why I get how all right so let's go backwards a little bit that's why now you're talking away all the way back three weeks ago okay roughly for no for the days of your yeah the days of your four weeks ago all right so the one of the things that and I want to give credit were credit to my con over's the guy who actually figured it out okay now we were using a much older particular model and we're going like yeah this with this work right and what a bolt down to is that there's a supposition that could you take an older model fine tune it with good data and still actually end up getting good results with the key point being that hey we're only going to pay $30 right to actually train the data as opposed to all the tens of millions of dollars that you'd have to do and could you do it that was the supposition for dolly one done zero and sure enough we were right basically it was about $30 worth of training time on what was what is not considered public data so that's why it's dolly one dot zero okay so we could give you them weights we could give them model but we couldn't give you the data because the data itself was actually not public but you owned it no no that was the in fact I believe was the chat you the same data that chat you be using okay so so we could give you the weights again that's open source right we can't do the data because the data is actually chat you be teaching all right so and so then we're going wait we actually used only a tiny amount of data and it still came out with some pretty decent results okay let's let's go ahead and say why don't we generate our own data so again take credit what credit is due our founders went ahead and said hey why don't we just get we have about five thousand employees at Databricks now this is my favorite part yeah let's just go ahead and generate our own data so for two weeks that's literally all we did we had basically a bunch of employees dumping in data and like an Acune a tie up style format with seven different categories it's all listed out there so I don't remember all the all those details anymore I worked on the t-shirts so at least I was helpful on that part love the t-shirt yeah that's a good one no one's seeing this right now but it's a well yeah it is a podcast so that that's right that's right that's right that's a lot of draw a word picture Adam dude a sheep come on man it's a she-dolly dolly dolly gosh oh my goodness he thought he was on point oh okay so dolly the sheep clone right it's a clone right so that's the whole context yes so we go ahead and actually get that up and running and then we're like hey now we've got 15,000 plus so set of of of Q&A style new information all brand new and we're publicly giving it away right so not so the actual data set if you go to hugging face or Databricks labs the slash dolly or whatever the GitHub site is yeah basically all that data is there okay all 15,000 lines oh sorry lines 15,000 Q&As okay and then we trained that data set again using the same old model from two years ago okay okay and we ran that and then basically what was really cool about this is that it cost us about $100 worth of training but it's pretty good and if you ask some pointed questions on this stuff it actually responds really really well for example I've got to like some examples where I'm actually asking coffee questions and the coffee questions answers are okay I'll give chat gbt4.0 a lot of credit yeah it is much more verbose than what dolly 2.0 can provide but in terms of correctness it is correct they both are the same level correctness between dolly 2.0 and chat gbt4 I actually have it on my own like I had time my own GitHub somewhere like I actually explain all that mainly because I was actually running it on an M1 Mac too because I was goofing off and it was just fine well it's amazing right there yeah let me first just say as a daily user of chat gbt sometimes verbose is not desirable and I'm like dude I actually will tell it to be brief or in one sentence give me this because I'm so sick of the word salad that spits out I'm like I just want the answer the answers are you know useful yes but sometimes you're like waiting for it to tell me the whole history of the thing you're like no what don't you want to know like the retrospective while you're at it I'm just I'm being very sarcastic yes you can't tell it's a podcast but we're all I rolling each other on that one yeah that was major I rolls so using it hmm let's say I've never used anything but chat gbt's web UI sure I'm a developer sure and I want my own I want dolly to answer my questions yes what does that process look like for folks okay so you've got two choices or no no I should replace light you've got many choices in fact but the most common choices are we have a Databricks notebook that's in the dolly github that you can just download for free run it now then you're going to tell me but Danny I don't want to use Databricks that's fair I would prefer you to but I understand if you don't that's fine go to hugging face the the instructions are all right there on how to use it in fact like I was saying I was actually playing with it so that way I can optimize for an m1 Mac and so that the answers could go back faster right my only problem was that when I started testing it there was a obvious bug in PyTorch okay so the because the basically when we told it to go ahead and use the m1 is giving us back garbage answers like it wasn't even like actual answers is like it was literally like like non-sensical characters okay and then and when we use CPU mode it worked perfectly fine so but then Jess has about to create a new issue on PyTorch they fixed it no that's a good thing no I know I also had the fix oh yeah the fix okay that's it I get you you're about to have a good time yeah but no that but it's fun but basically the ideas I then obviously okay I shouldn't say obviously you probably don't want to train within them one but you can definitely do inference with them sorry the Q&A so you got your data so how do you collect that data and how do you format it so that dolly can understand it no joke I'm assuming you're saying so don't use Databricks data you could do the same thing like you literally when we ask people to fill out with a Google form okay that's literally it so there's no other questions oh no no they would could produce the questions and in the answers they would ask a question and then it would provide a detailed answer for it I see so how do I don't make it in a special how do you make right it wouldn't even be how do you make it a special for example let's be very specific okay it would say what are the what are the particular features of great espresso okay and then we would talk about okay you're required to have a fine grind you're required to using a conical burger grinder there's a religious war between flapper grinders and conical burgers I put in conical burger grinder so yeah I'm sure the flapper grinders are pissed off that that's not the answer that they're going to get from dolly that's biased you're putting a lot of on this absolutely there's absolute 100% bias let's not pretend there isn't okay okay so it also requires you to actually have coffee beans made roasted in a particular way it also requires you to have the espresso water boiled at a particular temperature okay so you put all of those details down that's the idea like so in other words it's not just like okay hi how are you like what's great espresso you buy from espresso vaché and Seattle let me while that's true and I'm basically I don't own any stock in them by the way I'm but they are easily the best coffee who's the brand against the espresso vaché and Seattle espresso vaché yeah David Schoamer is a is a magician when it comes to espresso okay but the context is like while as much as I want to just provide an answer like that the reality is no that obviously we can't train that bad we actually need has verbosity to provide context provide proof if you want to put it that way because there's going to be other people putting other answers to oh so for example in this case I'm just gonna call them buddy by Rob Reed he's a fellow cyclist he's also fellow coffee addict I know he also put some coffee answers inside there as well okay so between everybody that put coffee answers in there that's actually literally you're coming getting data from myself from rob and a few other folks from well data bricks right and how many instructions are in there that you guys put in the 5,000 employees oh 5,000 employees put 15,000 15,000 so it's remarkable if you think about it that's remarkably small yeah we were always under the impression when we started this process that this we would require hundreds of thousands or like how does it know you gave it coffee in this process yeah yeah no we were something really different like say dolly1.0 shocked us like it really shocked us because we thought we would need to put in a lot more data we thought we would need to do a lot more training and then they were like wow this is not bad I mean it's not perfect but it's not bad actually and so from a business perspective when it's a happening it's like if you have your own business now your data like you don't need like a million things you've got 15,000 pieces of information now the great thing and I'm not telling you to use Dolly by the way I mean obviously go use it if you want to but I'm saying use any open source model I don't care which one that way you get to go ahead and keep it and have your data as your IP so you as a business end up like using the data actually in a good way right where you actually make it advantageous for you yet also keep in the privacy for the users that make up that data at the exact same time so the move is you have these I don't know if this is technically what a foundational model is or you have these models that are large enough language models right right and then each company or each order each use case says okay now we're going to fine tune it right that's the right lingua or not this and apply it to us right and exactly there's all the other models out there they're already like a lot of people were asking me originally like hey okay well then that mean you need to use Dolly I'm like no no no no Dolly was just us proving that it can be done that's all it was so there are a lot of really good companies whether it's hugging face or anybody else that produces solid open source large language models yeah use those too because the whole point is that you can use it yourself run it with smaller amounts of data have really good answers and you're paying a hundred bucks at least in our case we did yeah to find a train it right yeah so we're like okay that's actually worth your business you're protecting the privacy of your users you're going hand actually having relatively solid answers and you're not basically giving your data away to another service because that's the key thing about when you use a service right that you're basically giving away your data so they can go train against the two right right now I know Microsoft and open AI for example you calling those two out in a positive way not negative usually I'm a former Microsoft in place I'm a lot of being negative I want to but this actually may be positive they actually have introduced concepts saying you can pay more to train right and that they'll never actually use your data but I don't remember the the cost but it is definitely paying more yeah yeah so what's not as valuable to them so it makes sense right exactly so that becomes more of a turn to act in that way exactly so have you seen the Googlers leaked memo about we have no moat because everybody talks about that memo and what's interesting about that whole concept is that I know it sounds sideways but I was about to actually give you another contact and this is actually again my call over I want to give credit attributions to the guy who actually said it what's really interesting about this whole thing when they talk about most to talk about everything else is that more more fundamentally we could have done this two years ago we could have taken this concept of basically saying small amount of data foundational model fine-tune it right and actually have good results so all of us were focusing on I need a bigger model I need a pump but more data I need to scrape the entire freaking internet and chuck it all into the second model spend tens of million dollars warp every single GPU until as you're basically melts in order to go ahead and train this thing to the heat death of the universe exactly and then meanwhile it's like or we literally could have taken a foundational model that was okay to good a hundred bucks and bam we get something good yeah so when they talk about like the there's no moat and all this other stuff between open source and not literally my attitude toward this whole thing is like no just step backwards for a second okay the realities we could have done this we all got attracted to the idea that the shiny thing of oh bigger larger larger larger more that's all we got attracted to and so in the end I'm going I don't care like these companies the ones that quote unquote are trying to build them out around themselves while they're doing they're trying to make sure that they have a service in which you will give them your data and then by definition you will give away your competitive advantage right simple that for the folks that don't want to do that which I think is the vast majority then my attitude is like quite simple then don't do that and build your own model now how about if I'm the general consumer I just want to pump out a good blog template for me to work with yeah absolutely why not yeah like seriously I'm not I'm not trying to say these services aren't worth well quite the opposite chat TV's fun very valuable yeah it's extremely valuable in fact I've already had it but pumping out code for me just for just for just for just the giggles yeah so my russ is going to pump out some slides for you here soon for tomorrow yeah so take that 30 minutes 30 into 12 oh yeah that'd be perfect yeah but see you get my drivel like so yeah totally yeah so my rusty my rust code is rusty and so basically using chat gbd to basically pump out a bunch of rust code for me I'm like hey this is a good great baller play now I've got something to work with and boom now I can start writing it right yeah so what is data bricks is play in this chess game like what's your guys is angle our angles quite simple you've got a ton of data you need to ETL at processing the first place then you need to have a platform to run machine learning what or data science or AI or whatever fricking wording you want to use okay whether it's LM's today deep learning yesterday or tomorrow image optical resolutions object recognition I don't care okay the point is that you have a ton of data you need to be able to process it you need to be able to access every single open source system or service data bricks play is quite simple we just make it easy for you to do any of it yeah that's it that's all that's our only play let's make it easy yeah yeah are you for I guess then people own their own data I don't know that that's your so so here here's the thing I'm absolutely for both from a data bricks perspective but also from an open source perspective right yeah so I'm an open source contributor are you contributed to Apache Spark and MLflow and I'm also a maintainer for Delta Lake okay and so yeah I by definition I'm always going to lean toward open source which means you should own your data data should be competitive advantage everything else should be open source basically for all intents and purposes I'm even for things like differential privacy and privacy preserving histograms to basically protect your data so and I can go on on a diatribe on that so let's not do that but the context is it's not I'm not saying though these services like open AI or you know being or whatever else are worth while they are they're cheap they're helpful in fact training other systems isn't necessarily a bad thing either it some for me it's not about don't do it it's about knowing what you're doing right that's it yeah no transparency exactly that's that's that's my encounter if you want to use open AI within a data bricks platform we make it easy we have a for crying out we add a sequel syntax directly so you can literally write spark sequel which basically is at this point is basically seek at these sequel compliance right literally write sequel to go ahead and access your open AI to to run an lm model directly against your data so literally party hardy have fun so it's not our attitude isn't so much like don't use one versus the other our dude's very much no no just know what you're doing understand when you're using something like a service understand when it makes sense for you to build your own model and we also make it easy for you to build maintain train infer against that model that's it so I mentioned we have our transcripts is open source right we got yeah everything we're saying here when it is a lot of it's going to be transcribed into words how are ways we can use valley 2.0 this open model that you're talking about this this direction how can we leverage these transcripts for our personal betterment as oh as a podcast company for example as a podcast company one the first things in fact I'm actually already doing this technically for Delta Lake okay is that we do we also have podcasts ourselves okay so what are we doing though I'm spending time and effort to generate blogs based off of the podcasts why because it's better for Google SEO search right so and it's not it's not like I'm trying to just repeat the same thing I'm just trying to summarize because you know we've we talked about barbecue in the beginning right we talked about coffee we probably don't need all of those details inside the transfer for the podcast to our blog you want people to go ahead and actually understand what they're talking about when it comes to dolly cool we generate a blog based off of this conversation it can summarize it get to the key points boom there you go it simplifies the whole process so that way you're not spending to absorb an hour trying to figure out how to basically synthesize the key points out of our conversation right now right so it's still time for you to review and look to make sure the model isn't giving you garbage there's still time for a producer or for any other person who is knowledgeable in this field to validate the statements maybe I'm full of you know BS of all I know right and then so you get in that sort of yeah yeah I don't know Denny's full of it and forget it and most likely be the right the conical versus flatberg grinder but again you know that's that's a whole other story the whole summer I'll just be in that I'm one of your team is me I'm conical team conical there you go perfect see so but the context is like we can go ahead and actually use these systems to simplify would it be cheaper and easier if we just went ahead and did like chat you be to do it yeah go for it could you would it be worthwhile to do it in your own dolly model absolutely because you have your own style right yeah so if you have your own style if it's building if dolly or any other open source model again I want to be very clear here is going ahead and be trained against your transcripts it will then be able to start raining blogs based off of your style right that's the cool thing about it is it cool to actually chain like that or is it better to go to the foundational model and then just our stuff or to be cooler to be like well start with dolly because it has instructions and then add our style and then maybe add something else not only my answer is all of the above because just whatever you want no no no we don't know because that's the whole point different foundational models will provide different will be better at different things as simple as that some models will be better at for example conversations some models will be better for writing purposes there's all like I'm Nat dot dev I'm forgetting the guy's name that's read man that's read man god I don't believe I face face down on that he's not nobody so yeah he's he's a small guy okay so that's read man of it former CEO of GitHub okay so slightly important guy Nat dot dev is an awesome playground for example where you can test out a lot of different models already and you're literally just chucking like hey let me try with chat TV3 let me try with vacuna whatever else use and literally you will see with the same question especially we do the compare playground section different answers from the different models yeah so yeah like literally you got to play a little bit to figure out which model makes sense for you yeah well thanks for talking with us Denny glad to always aside from your opinions on coffee and whatnot you're you're pretty good so I'll do yeah no those are fighting words I just want to say that okay those are fighting words oh that's good all right gentlemen thank you very much yes thank you hey friends this episode is brought to you by CICU the founding sponsor and partner of rocky linux enterprise linux the open source community way and I'm here with rigor cursor the founder and CEO of CICU and the creator of rocky linux so great I know that a lot of people are still sort of catching up to some degree with what went down with centos the red hat acquisition and just the massive shift that required everyone using centos to do give me a give me a glimpse into what happened there you've seen a number of cases in the open source community where projects were pivoted due to business agenda or commercial needs we saw that happen with centos centos was one of the primary one of the biggest enterprise operating systems ever people were using it all over the place enterprise organizations and professional IT teams were all leveraging centos for centos to be stripped away from the community and removed as a suitable option to meet their needs created a massive pain point and a gap within the industry as one of the founders of centos I really took this to heart and I wanted to ensure that this does not happen again and that is what we created with rocky linux and the RESF okay you mentioned the RESF what is that and what is its relationship to rocky linux the RESF is the rocky enterprise software foundation and it is a organization that we created to hold ourselves responsible to what it is that we've promised that we're going to do with the community it is community run it is community led we have a board of directors which is comprised of a number of people that have a huge amount of experience both with linux as well as open source and community and from this organization we solidify the governance of how we are to manage rocky linux and any other projects that come and join in this vision sounds great I love it so enterprise linux the open source way the community way has a home at rocky linux and the RESF check it out and learn more at rocky linux.org slash change log again rocky linux.org slash change log all right Stella beaterman yeah and you're with I'm going to also butcher the name of the org eluther AI eluther eluther AI yes okay what is this what is eluther AI y'all we're just talking uh with the data bricks about dolly this is right yes correct so uh that was built on top of a open source language model okay yes I trained that okay so you're underneath dolly yes okay so you personally trained it yes okay what's the model it's called pithia pithia it's a it's a sweet of language models actually that we put out a couple months ago okay but in general eluther AI has trained several of the largest open source language models in the world in the past three years okay very nice so what do you want to tell the world then what do I want to tell the world um um honestly didn't think that far in advance okay all right well what's the way no what should the world know about what you do in terms of training models that so that a brec she uses that's open source etc honestly especially like the open source world should really know that the the AI world really needs help from the open source community writ large that's actually broadly speaking why I'm here at the Linux open source summit okay uh you know we're we're struggling with love issues about maintainability issues about licensing issues about regulation um issues about buildings to say nublig ecosystems that the open source community writ large has been doing working on for for years if not decades yeah and a lot of people in the air world are a little too proud to ask for help from non AI people uh which is definitely a real systemic problem but there is I think a lot of if people are excited about foundation models large language models ever you want to call them and want to get involved and don't know or want to help and don't know that much about AI there's a ton of open source work that that needs to be done that we need help with um to build a robust and enduring ecosystem where is the money coming from where's the money coming from great question so at aloother AI we recently formed a nonprofit um and we have donations from a number of companies most most prominently uh Google stability AI and in hugging case okay um and core weave are among our biggest sponsors we have also been applying for grants uh from mostly the US government to pay for our uh i guess forthcoming research and work in terms of like computing resources it's actually like training these really largely which models is not that expensive which is like is that a secret i don't i don't know i don't know if it's a secret or or what but like i think that the cs world kind of got used to the idea that anything can be done on like a personal laptop and that that's kind of what constitutes a reasonable amount of money to spend on a paper and like that's great there that there's a huge accessibility bin for doing that yeah but training these large language vaults it is pricey um you know it's it's not something that anyone can do on their on their own but it's not runestly expensive there are thousands of companies and around the world that can afford to do this there are dozens of universities that can afford to do this and by and large they just haven't been okay um so it's the model that you trained yeah how much of that cost uh so we train so it's part of a suite of models that had like 28 in it total uh but altogether that was like less than eight hundred thousand dollars the largest model one training run would probably be like two hundred thousand dollars um which which that's more the laptop which is more than a laptop it's less it's it's not like aim my evolving amount of money and less than a super bowl commercial it's true yeah so right now the largest open source well okay the second largest open source English language model in the world is called gtb neo x uh we train that i train that my organization and that cost us about three hundred fifty thousand dollars or what of if if we didn't work given the compute for free but like three hundred and fifty thousand dollars for the second largest open source language wall in the world and at the time we released it was the largest uh later someone else trained a a bigger model with sponsorship from the Russian government but it's for there's so gtb3 came out in 2020 and for about two years almost nobody was training in open sourcing language models um google was doing it with similar models but not like the same kinds of models that gtb3 is um and we were doing it it was really not that expensive uh we we got into it on compute that we got for free through a google research computing program called the TensorFlow research cloud and you know with that we trained a six billion parameter language model the one that underpins the the the first version of deli that uh they was talking about that's been that's been extremely widely used deployed in a whole bunch of different industry and and research contacts and and been hugely successful and it was literally just like impute google gave us for free yeah it ran preemptively on their research club basically the idea of TRC is that they have a research cluster that they don't always use all of and so other researchers independent researchers academics nonprofits can apply to be able to run preemptible preemptable jobs on their research cluster and just use the compute that they're not using at the time and using that we trained this model in like two and a half months wow and it was a really big deal when it came out it was the largest model of its of its type in the world by a sizeable margin it was about three times the size of the four four times the size of the largest open source model of its type in the world yeah and the the pithy models we trained on like 128 100 GPUs for a couple weeks which is certainly a lot of computing resources but it's not like mind-boggling amounts of compute there are lots and lots and lots of companies that have that that could you know it's it's less about it actually being too expensive and more about kind of having the political will to actually go do it yeah are you focused on training open source models is that your focus so our focus is on open source AI research in general our kind of area of expertise is large scale AI and most of what we do is language models but we've also worked on training and releasing other kinds of large scale AI models so we are part of the open fold project so deep mind created a algorithm for modeling protein interactions called alpha fold that was a really big deal and we helped some academics scale up their research and get that and replicate that and release it open source we've done some stuff in the text image space both on our own and some of our staff have kind of gone on and worked at stability AI on on some of their language sorry image models and we are a big proponent of open source research in general so our kind of the reason we decided to start training these large language models was back in the summer of like 2020 we thought you know this G2B3 thing is going to be a major player in the future of AI and it's going to be really essential if you want to be if you want to be doing something meaningful in AI you you probably want to know how these things work you want to be able to experiment with them we want to have access them and back then you couldn't even pay open AI to let you use the model yeah they announced that they had it and that was it and so we said well let's let's try to train a model like that we'll learn something along the way and so we started building like an open source infrastructure for training large language models we created a data set called the pile which is now kind of the de facto standard for training large language models we created a evaluation suite for consistently evaluating language models because everyone runs their evaluations a little differently and there's huge reproducibility issues so we built a framework that we could release some in source and run on our own models run on other people's models and actually have kind of meaningful apples apples comparisons and we started training a large language models we trained a 2.7 billion per-mode model which is like a little bit bigger than G2B2 was at the time and then we started training larger models six billion parameters was the largest open source G2B3 style language model in the world 20 billion parameters was the largest language model of any sort to be released of in source in the world you know since then there's been a lot more investment and willingness to train and release models there's several companies that are now doing it so mosaic is a company that released a nine I want to say something a large language model that's that seems really excellent like last week there is meta which has been training and releasing sort of models they'll tell you that they're open source releasing models but that's just not actually correct they're under non-commercial licenses and they're not open source despite their retarget to the contrary but there's there's a whole bunch of companies stability AI is training loaders language models and now there's a lot more people in this basin and doing it and releasing it and honestly for my point of view like we got into training a large language models mostly because we wanted to study them we wanted to enable people to do essential research on interpretability ethics alignment understanding how these models work why these models work and what they're doing so we can design better models and so that we can know what appropriate and inappropriate deployment context for them are and so now that there's a lot more people working in kind of this open source training space we're moving more towards you know doing that kind of scientific research that we've always wanted to do so in the past six months we've been doing a lot of work in interpreting language models and kind of understanding why they behave the way they do my personal kind of area focus is tracing the behavior of language models back to their actual training data so the models that the w2 is trained on the pithia suite what kind of makes that special is that most language model suites are very ad hoc constructed I'm calling them suites because you have several models that are similar of different sizes right so like the opt suite by meta for example ranges from 125 million parameters to 175 billion parameters but they're not actually very consistent between them some of them even have different architectures they've different data order they there's a lot of of stuff that kind of limits your ability to understand to do controlled experiments on these models and so we sat down and we said if we wanted to design from the ground up a suite of large language models that was designed to enable scientific research what would it look like what kinds of properties would it have what kinds of experiments do we think people are going to want to do that we're going to need to enable and we built this list of requirements and then created a model suite that satisfies that so it was trained on entirely publicly available data all of the training it was trained on the same data every model in the suite was trained on the same data in the same order and we have a whole lot of intermediate checkpoints that are safe so if you want to know you know after 10 billion tokens how each model in the suite is performing you can go and grab those checkpoints after 10 billion tokens and then you can say okay what's the next data point it's odd during training after 10 billion tokens what was the 10 billion first token and you can actually use some some stuff we've uploaded to the internet to to actually load that data in the same order it's seen by the models you can study kind of how being exposed to particular training data influences model behavior so we've been using this to right now primarily to study memorization understanding because language models that were for cement city for reproducing long exact sequences from their training corpus and we're interested in understanding what causes memorization why certain strings to get memorized and others don't right now I'm wrapping up our kind of first paper on that we have some more research in the works trying to understand you know looking at the the actual models throughout the course of training and looking at kind of the training data points at VC and trying to reverse engineer what that actual interaction between the model and the in the data is and yeah this is only I personally really high on most interpretability research right now is kind of focus on final trained models as like pre-existing artifacts so you have this this trained model and you want to understand what behaviors it has but you know my perspective as someone who trains these models is much more focused on kind of where they come from and what especially like my overarching goal is to kind of you know if I as a person who trains a large language model have a particular desire for a property the model has a property the model doesn't have what decisions can I make to actually influence that and to make the model have the properties I want it to have so if there's data I don't want it to memorize is there a way that I can know ahead of time what's going to be memorized that's the the paper that we have that we actually just released on our archive about forecasting what is going to be memorized before you actually train the model is that to make it less black box more like you deploy it you don't know what it can do so that you can sort of understand okay here's the data here's how it's trained to sort of have a more clarity of what the box actually contains versus this black box is that why that's important that is what the field of interpretability is about in general okay and I would say kind of building on that that what my research is about in particular is not not just opening up that black box and looking inside and understanding what the model is actually doing but understanding where it came from and how how we can build boxes that are are more transparent from the ground up predictable maybe even yeah yeah so I mean that's one of the fears is is you know especially like being yeah when they put that out there I think what it threatened the person like there was a like there was some sort of like threat on humanity essentially and it's like you deploy this thing out into the world and you don't understand what they could actually do is that to be more predictable more absolutely to some degree and even designable I say well forget these things remember these things yeah that the designability is a really big component I think that's going to become huge in the future right and really it hasn't been studied primarily because people haven't had the tools very few model suites have intermediate checkpoints at all a lot of publicly released models weren't trained on publicly release status sets or if they were trained on publicly release status sets they didn't tell you what order it was trained on and it turns out that matters a lot um what it saw early in training what saw latent training and so there's really a huge reproducibility issue in terms of under like if you want to dig in and really understand how data by data data point by data point the model is learning to behave you need to be able to basically fully reproduce the training not actually because you're not going to spend a couple hundred thousand dollars but at least in principle you need to be all in spec individual data points no when it's going to get loaded understand kind of how it works and this is something that we've put a huge amount of of resources into both on the training side as well as kind of on the engineering side it was not easy but you can actually reproduce our model training exactly so if you take the the code base that we used to train this model these pithia models and you pick a checkpoint and you load that checkpoint and you resume training from that checkpoint you will end up with the same fully trained model that we did exactly that's important that is really important it's important because if you want to understand how to design models you need to understand how they're changing over the course of training and that is is really persecutive and really sensitive to a lot of implementation specific details that tend to knock at released how far on the future do you think since you're at the training level you're like the the ground level of if this is the eureka moment for humanity yeah right how far on the future do you think and do you have fear trepidation hope like where will this take us as humanity i really don't know my kind of attitude is that the recent pair like there was a really big paradigm shift in 2020 with the release of gtb3 and the aggressive focus on scaling and and people really changed their attitudes towards like how to design language models and kind of how they can be used and what they can be used for in a sense we got really lucky because it wasn't that dangerous you know there were a lot of fears about what gtb3 could do and by and large it turned out to be pretty safe there wasn't all that much harm done and a lot of the fears turned out to be not come to fruition and you know kind of looking forward i think the really important thing to think about is we we obviously can't predict the next paradigm shift but building tools that allow us to hopefully more readily adopt and adapt and respond to feature paradigm shifts in in large scale AI so that you know one day they're probably will be something that gets the help that is dangerous and we want to be able to be i guess ready for that yeah yeah cool well what are some touch points people who are interested in what you're up to want to help out want to give money want to read our work where can people connect with you so the best place to connect with us is our discord server um we are a research institute but we actually operate basically entirely in in the public view uh we're distributed all over the world and we do our research uh in a public discord and anyone can join anyone can drop in read about what we're getting up to hang out with us chat with us about about AI so our discord server is discord.gg slash eluther ai um there's also a link on our website which is eluther.ai checking me um we'll link it up the show notes for sure yeah yeah and yeah we're we're always happy to take on more volunteers um we have a small professional staff um in a large number of volunteers that help out as well how small a small uh like ten full time in place okay and if they go to the discord server what can they do there what can they expect from the discord server like you're there others are there yeah so you can chat about AI uh we have a bunch of discussion channels where people talk about kind of cutting edge trends in artificial intelligence honestly like I don't really follow AI publication news anymore because I just follow my discord server and everything that's important shows up for me there you go which is a really nice place to be but you can you can talk with us you can talk with other researchers we have a large amount of researchers at the cutting edge of AI I can't count the number of times that someone's posted a paper and been like hey this is really cool like does anyone know anything about this and someone just like tags the guy who wrote the paper that happens all the time we have people from open AI and throttach meta like all the major labs who come deep mind come in and chat about language models give it a dice give you know perspectives on on research and talk about kind of how things are going uh you can also get involved with ongoing research projects so we have a dozen ish ongoing research projects ranging from learning to train figuring out how to train better language models to training language models and other languages so if you look at like the list of the hundred largest language models in the world basically all of them are English or Chinese yeah and you know so if if you want to spread the benefits of this technology and the ability to kind of use and understand this technology to the world with writ large like not everyone speaks English and Chinese and even the people who do often also speak other languages that they care about so we're training um we've trained in really several uh Korean language models um we're currently training uh with the plan of releasing some in-dick language models uh as well as um uh romance language models so yeah on the developing new model side we do research like that um on the interpretability side we do a lot of a lot of different stuff understanding training dynamics understanding how to evaluate language models understanding how to kind of extract the best information from them we recently started up some work on kind of red teaming them and trying to understand you know there's a lot of stuff out there right now about promptacking about how people are trying to put filters on language models and they're kind of not really very successful and trying to understand like what the dynamics of that is like whether you can uh build meaningful safeguards around these things or whether it's always going to be subverted do a lot of work like that as well very cool well thanks for coming on the show Stella yeah it's a pleasure it's awesome having this deep dive with you i love that thank you great to meet you guys yeah so if you had told me a few years ago that I'd be going to an open source summit and talking about AI in open source at this level from Cody a coding assistant to data bricks and training models on small data sets to Stella's work and eluther a i's work on opening i research and all these things that'd be real that'd be touchable that'd be usable today to transform my work to transform your work to transform the world around me I would not have believed it but it's true we're here and this show is awesome so hope you enjoyed it once again a big thank you to our friends at github for sponsoring us to go to this conference as part of maintain your month there is a small bonus for a plus plus subscribers so stick around for that if you're not a plus plus subscriber it's too easy changelog.com slash plus plus we drop the ads we obviously give you bonus content we bring you a little closer the metal and the best part you directly support us ten bucks a month a hundred bucks a year changelog.com slash plus plus that's it this show's done thanks for tuning in we will see you on Friday