Apr 30, 2024 · Episode 14

The evolution of data engineering in the age of AI

Colleen Tartow, Field CTO and Head of Strategy at VAST Data, shares her experience from years in data engineering leadership roles.

Show notes

In this episode, Rebecca and Colleen talk about how “data” has evolved over the years, how AI is turning everything we know about data upside down, and how a PhD in Astrophysics got her started on this path.

Timestamps

(0:00) Introductions
(0:44) Colleen’s background
(3:01) About VAST’s approach
(7:18) Unique challenges of data engineering
(10:56) Setting reasonable expectations
(13:34) The importance of process
(16:29) The future of data engineering
(21:00) What Colleen misses about engineering
(22:39) How to set goals for a data organization
(25:58) Colleen’s perspective on people in business
(29:58) Colleen’s unusual path to data

Links and mentions

Transcript

Rebecca: Colleen, it is so good to see you. I know nobody else knows how hard we have tried to record this podcast, but we have tried so hard. Welcome.

Colleen: Thank you. It’s great to finally be here.

Rebecca: Thank you for being here. Tell me a little bit about who you are and what you do.

Colleen: I am Field CTO and Head of Strategy at VAST Data. I’ve been in the data field for over 20 years now and I do a lot of different things. I’ve run engineering, I’ve run data engineering, I’ve run analytics organizations. I was a data engineer myself, I was a field engineer. And so I’ve sort of had a career path that has taken me to very many different places over time.

Rebecca: So tell me about this Field CTO thing, ’cause I also have this title and any tips are appreciated.

Colleen: Yeah. I think we both decided it’s a newer title, right? You started seeing Field CTOs maybe five or eight years ago, I think, for the first time. There weren’t really that many of us around until then. It’s sort of a catch-all role in some ways. There’s definitely a strong technology aspect to it. In a lot of ways, I’m the voice of the customer to our R&D org, which is more of a product role. And so there’s some product management of it.

There’s also a large strategy component where you’re doing things like strategizing rollout, strategizing how we take the vision of the company and actually make it happen and get the word out to our customers. And then there’s things like testing, training, implementation that I work through. So, I work closely with R&D, I work very closely with the field.

And then there’s also a marketing and evangelism component to it, where you’re doing a lot of thought leadership and speaking, writing, going to conferences, finding out what the field is doing so that we can make sure we have a competitive advantage. And then there’s all the other stuff too, like supporting sales and working with support and being an escalation point for pre-sales, that kind of thing, which is fun. Every day is different.

Rebecca: Every day is different. And you’re getting to wear a lot of hats, at least I am. What’s the percentage of time that you’re spending in spreadsheets compared to when you were an engineering leader?

Colleen: It’s probably less, but that doesn’t mean that I still don’t love my spreadsheets and I still don’t come up with spreadsheets for things that people are like, “Oh wow, you made a spreadsheet for this?” I still love my spreadsheets. Pry them out of my cold, dead hands.

Rebecca: That’s okay. Yeah, you were a born engineering leader if you love your spreadsheets. Tell me about VAST. We kind of skipped that part. Tell me about VAST, but then I want to hear what are you trying to teach VAST customers or prospective customers?

Colleen: Yeah, absolutely. VAST is a data platform for AI, and I feel like that is a very amorphous statement. There’s a lot of things that call themselves data platforms out there. But basically, we’re in the business of accelerating time to value for data and increasing ROI on data. And what that means is that we’ve rethought the data stack and made it more applicable to AI because there’s a lot of technology stacks for data out there that are focused on structured data; they’re focused on BI.

And now, AI is the new hotness and everybody’s trying to be like, “Oh, we do AI too.” And if you build this environment, that’s optimized for BI, structured data is only five percent of the data that’s out there or something. It’s a small number. And so you’re not going to get the scalability that you need for AI, you’re not gonna get the performance. The costs are gonna be out of control.

And so VAST is in the business of really providing a data platform that can do it all. And so we have the VAST data store, which is an all-flash data lake. You can think of it as an all-flash storage platform for all of your data. Structured, unstructured, large, small. And there’s different protocols, so you can treat it as NFS, you can treat it as S3. And there’s a database as well. And so that’s for your structured data. And what’s cool about that is it’s both an analytical and transactional or operational database, so you can treat it like you would Oracle or SQL server where you’re throwing ACID-compliant transactions at it, and you’re ingesting directly from IoT or wherever your data is being produced.

But then, without having to build a pipeline to another system that’s built for analytics, you can use that same database as your data warehouse. And so you kind of get the best of both worlds in that respect. And because it’s fast and we’re all about pushing things to the limits, we have exabytes upon exabytes out there in the field right now. So we’re making sure everything is scalable and cost-efficient, and performant at all scales. That’s really essential for AI.

Rebecca: I love a good metric. I love this time to value. Is this a metric? Is this a concept? Are you trying to move this from three minutes to two minutes or never to tomorrow?

Colleen: You know, a lot of cases, our customers are coming up to us and saying, “this worked well three years ago. It doesn’t work anymore. The scale is different now and it costs too much or it doesn’t perform.” And so we love those problems. We can bring in our VAST system and they’ll be like, “Oh, look, it’s faster. And now it’ll scale with that performance and at a linear cost.”

But then we also have a lot of customers who are either being told by their board because they’re doing AI, whatever that means. But the scale of data and AI is vastly different than the scale for BI. That’s the other 95 percent of the data out there. And so, for that scale, it’s a different ballgame; you have different considerations. I mean, power becomes a consideration. That’s not a consideration when you’re doing BI, but you got to start thinking about that.

So if you’re training foundational models, that’s a really challenging infrastructure problem. And most people aren’t doing that. But, if you are, that’s really challenging and really complex. And so it’s a fun problem to solve. And so, VAST, we like to solve hard problems like that.

Rebecca: I love the things that you can just go buy off the shelf now, like a data platform, it’s just there.

Colleen: Yeah, well, there are a lot of AI cloud service providers coming online, things like CoreWeave or G42 or Lambda Labs, where each tenant gets a VAST platform. So each tenant would underneath be running VAST as their storage, and additional features. And so we really are the storage for AI and then the features for processing AI.

Rebecca: So you’ve had a long career in data and analytics. What are the unique challenges of data engineering versus more typical product engineering or software engineering?

Colleen: Yeah, that’s funny because, maybe eight or ten years ago, I was hiring what now would be called a data engineer and we didn’t really have the term data engineer yet. And so I remember trying to describe to my recruiter what I needed and they were like, “Oh, we’ll hire you a software engineer.” I’m like, “It’s not really though.” And so we ended up coming up with the title ’software engineer in data’ or something like that. And now that’s a data engineer, but really it’s not about– I mean, they’re both building, right?

And there’s a lot of the processes are the same. I always have my data engineering teams use Jira and do Scrum and all that good stuff. But instead of their output being a product, like a website where that does something or a platform that does something, they’re instead actually building pipelines to get from raw data to business value, right?

And that business value can take lots of different forms. It could be as simple as a report, it could be some dashboards, it could be some machine learning, or it could be deep learning and AI. So it’s sort of like there’s a spectrum of use cases on that consumption side, but they’re building the pipelines, they’re building the infrastructure half the time. So, data engineering is a very broad concept, but the key is that you’re actually building out how the data gets to the point where it’s providing value to the business.

Rebecca: In product engineering, you don’t generally want to build something that gets used exactly once. How do you think about that in data land?

Colleen: In data, the data is the product. And, in fact, there’s this movement the last couple of years about treating data as a first-class citizen of the business and really harnessing it as a product. Because your data is a business product, it’s just not directly monetizable all the time the way your actual product is.

So the idea of treating data as a product– I mean, you have some products that are large and some products that are small and there are going to be some that the juice won’t be worth the squeeze to make this complex pipeline for this one little number that this one person wants. But you know, that’s part of the almost product management of data. And a lot of data teams are starting to have product managers just because of this, because data is a product and you have to prioritize what’s more important to the business.

And so, I’ve run data teams where one of our biggest customers was the customer support organization for a B2C app. And it’s like, okay, well, what do they need? And how are they consuming the data? And it turns out it was in their customer support app. And so we had to make essentially a data product, like a data mart, we called it, but it was a data product. And it was essentially a small set of tables that had all of the information they needed for their application. But it was a production application. When we were engineering it, we had dev-test and prod environments that we would run in to make sure that it got to where it needed to go effectively and reliably.

But then again, there’s a lot of one-off, right? There’s a lot of one-off things. And so often, we’ll hire data analysts or we’ll have data engineers who, their job is more reactive and more small projects. But that’s the same as in engineering, too. There’s always small things and you’re like, “Oh, I have a little extra time in the sprint. Let me do some tech debt.” So it’s kind of the equivalent in a lot of ways.

Rebecca: How do you communicate to the people who want that one number that it’s not worth it? That it’s just not worth getting the number. And how do you set those expectations over time? I have experienced that once somebody important enough knows that there is data, they have a lot of questions. So how do you set those expectations as leader with your own leadership about what’s actually reasonable for you to be doing?

Colleen: Yeah, I think it’s kind of the same as in product engineering; the only difference is the customers are typically internal. So if you want a dashboard that tells you this one specific thing and it’s going to take someone three weeks to engineer that, I’ll be like, “Listen, three weeks of one engineer’s time is X dollars. You’re asking for this much infrastructure, it’s Y dollars. How much value are we really going to get out of it?” And so I want to be helpful, but it’s also important that we understand the value coming out of it, because if they’re like, “well, we can save a billion dollars in our organization if you give me this number,” then that’s different than “Oh, I just thought it was interesting.”

So you kind of have to make people understand that there’s an effort, and so there’s almost a branding exercise I’ve done for data engineering groups and data teams in organizations where I’ve made sure that people understand what it takes to get the metrics and the numbers. “Here’s what the pipeline looks like.” And not everyone’s going to understand it deeply, but just giving them some sense of why it’s not easy to just pull a number.

There was an old meme, I forget who did it, but I think it was Seth Rosen. It was like, data consumer: “Oh, can you pull me this one number?” And the data engineer said, “sure, let me just SELECT * FROM some_ideal_clean_and_pristine.table_that_you_think_exists.” That’s what they think is happening, but that shows the fundamental mismatch and expectations and lack of communication between the consumers and the data engineers.

And so I think just being really vocal about this is what it is. Because people understand what’s going on on product engineering for the most part, and they understand why products take a long time to develop, and I think data engineering needs to be the same in that way.

Rebecca: And I’m asking you a lot of– product is on the brain for me always. I probably should have been, I don’t know, I keep thinking “should I have been a product manager?” Maybe I should have gone into marketing. I don’t know, but the product is on the brain. But I want to talk about some of the technical challenges too, because I’m fending off the exec who thinks you can select star is just one part of the job.

Colleen: Yeah.

Rebecca: How do you actually start to think longer term about enablement? Not just being responsive to these immediate requests, but how have you helped the business make those hard things easier? How do you make time for that?

Colleen: There’s two answers to that for me. One is process. I’ve come into a lot of data teams and there’s no process for understanding or visualizing the backlog of requests. And so that helps both parties to understand, “yeah, I understand you think you need this today, but we’re working on this huge backlog,” or “this thing is clearly more important than anything else.” But just staying really organized on a team like that, a service-type organization, is really important.

The other piece is automation. And I do think it’s important to automate and minimize the number of pipelines, the complexity of the system, the end-to-end lifetime of data so that you can really focus on what’s special to your business, which is the curation of data and the application of logic that only exists for your business.

And so much of data engineering is all the other stuff, like choosing vendors and building out infrastructure and doing all this. And that’s really what I think has been an important focus for me in my career on the vendor side for minimizing the path to value for data. So minimizing the number of pipelines, minimizing the complexity, the number of replications, because that affects governance, making sure that you’re minimizing risk in that pipeline.

I’ve worked at places where, when I got there, they had a slack bot that every day would report whether the nightly ETL job ran or didn’t run. And if it didn’t run, that meant the data was more than 24 hours out of date. And these are dashboards that are going to the exec team. That’s not acceptable. It’s 2024! I mean, this wasn’t 2024, but I think that there’s a lot that can be done if you keep your eye on the prize of minimizing that path to value, making sure that there’s as few hops as possible in the data.

So, building out an infrastructure that supports that and then allowing your data engineers to really focus on the curation of the data, that’ll maximize the number of these requests that they can take from the downstream consumers.

Rebecca: It’s so interesting to see these things turning into, it sounds like a product. I’m imagining how these teams came to be. It was just somebody who was really good at responding to these requests and built some stuff. And of course you never make a JIRA board or– why would you do that?

Colleen: Well, it wasn’t their day job, even. It was a side gig. A side hustle.

Rebecca: Right. And they were just the person that you sent a slack to when you needed some information. What are people doing today that’s going to be, maybe it’s a little silly today, but it’s going to be really silly in five years in data land?

Colleen: The idea of the modern data stack has taken hold over the past five years or so, and it’s sort of the culmination of cloud SaaS tools. Everything’s as a service, they don’t want data engineers to spend any time thinking about infrastructure. And that’s great but now they’re thinking about piping their data all over the place.

And so there’s been this sort of, I don’t want to say revolution, but there’s been a revolt in a way back in the last six to 12 months where people are starting to realize that there’s nothing that modern about it. And it’s literally just a cloud SaaS version of the same process from 40 years ago, where data came into some transactional system, you copied it over to an analytical system, you did some stuff to it somewhere along the way and then you copy it a few more times into all these different consumption systems. We’re making little gains along the way, but I think we need to rethink that. And so my hope is that in five years, the idea of this completely composable, many vendor, many platform, modern data stack is just not something people use anymore.

I Interviewed someplace a few years ago and they were very adamant. Everyone told me, “well, we’re a Snowflake shop.” And I was like, “that’s just a tool. What is your value state? Your goal is not to use Snowflake. Your goal is to do something else.” And they were like, “Oh, that’s great. We should hire you.” And I’m like, “okay, yes, but…”

I think that organizations need to be less tool-focused and allow their data engineers to focus on the value of the data. Being less of “we’re a Snowflake shop” and more of “we are the hub that provides this secondary product, this data, to the entire company. And we have clear owners, and we treat it like a product.” And it’s not about the infrastructure. The infrastructure is there, and it enables all of that. But, especially with AI, data is being woven back into products. It’s part of the fabric of the company now. And so, in order to do AI, it’s all about the data. It’s not about the algorithms as much. It’s not about anything else. It’s about the data.

Rebecca: That’s an interesting change that I hadn’t thought of.

Colleen: Yeah, I think it’s a real sea change and it’s going to be interesting. I’ve loved talking to our customers who are doing things with AI and seeing what they’re doing and watching them go through each of the mental leaps to be like, “oh wait. If we want this in the product, then we need this, then we need this, then we need the data to be there.” And I’m like, “yes.”

Rebecca: “Yes, you do.”

Colleen: Data is the key.

Rebecca: Do you think we’ll see a world where less engineering is required to produce answers?

Colleen: I think it depends on the audience. I’ve been using SQL for so many years that, for me, it’s gonna take me longer to prompt engineer chatGPT to write it for me than it is for me to just write it myself. And it’s not that I’m distrustful of the result, but I’m the kind of person who, I’m gonna be like, “Wait, does that really do what I need it to do?” And looking for the edge cases.

So I don’t know that for our generation that’s true. But I do think that, as time goes on, there will be people who grew up using chatGPT. I don’t want to sound super old, but kids these days don’t know a world without the internet. We’re going to have another generation that doesn’t understand a world without these AI helpers. So, in school, they’ll be taught, “This is a tool; this is part of your toolkit is to use chatGPT.”

That said, it’s just improving so quickly. I don’t know; there’s just something about writing SQL, too, though. I love it.

Rebecca: I’m glad you like it. I will call you up the next time I need to write some SQL.

Colleen: I’m there. I’m your personal chat, ColleenGPT.

Rebecca:: I was going to say, you know who’s really good at writing SQL is chatGPT. SQL used to be my kryptonite and now it’s just, “Oh, do I have to?”

Colleen: No, I love it. But when I’m writing Python, that’s when I start being “is this right?” “Hey, chatGPT, can you check me?” Whereas SQL, I got it.

Rebecca:: “I got it.” And I am absolutely the opposite. Give me Python any day. Going back to your Field CTO role, what do you miss? Are there things that you miss? Are there things that you don’t miss in this role?

Colleen: So my previous role, I was head of engineering at another startup. And I’ve worked for startups my whole life. So I’ve always done the marketing stuff and the thought leadership, especially now that I’m probably on the back half of my career, I hope.

And so that was never in my job title, but I always did it cause I liked it, and I always liked the strategy side of things. I love that that is now my job. That’s actually in my job title now. And taking all that nerdiness and depth and data and analytics that I’ve accumulated over the years and applying it to my current company, which is fun.

But, that said, this is the first role in a long time where I haven’t been part of the R&D or engineering org. And I miss the routines. I miss the release cadences and the organizational leadership of engineering, cause I’ve been doing that for so long. And I miss that organizational work, the regularity of some of the things like Scrum meetings. Even if I wasn’t managing teams, I like to pop in, and there’s something about that. Maybe I’m a crazy person, but I always like that. And so I do miss that. And I miss the people too. I love working with engineers and I’ve worked with engineers my whole adult life, basically. And I still do; it’s just in a different capacity. It’s now more the voice of the field.

Rebecca:: Yeah. I watch the engineering team from afar.

Colleen: I know. It’s weird, isn’t it?

Rebecca:: Yes! So what do goals look like in a team or an organization where it didn’t grow up by being very responsive to whatever somebody was asking for? That was probably the origin story of these organizations. How do you turn that into longer-term planning and setting goals, and having things that you’re trying to accomplish over the longer term?

Colleen: Yeah. I think that’s part of the fun of it, is that data organizations do tend to have somewhat organic evolutions within an organization. That said, the data organizations I’ve worked with, I very much run a data-focused organization for a data organization. So I guess it’s meta-focused on data. But can you have a dashboard about your dashboards? Sure! But what are the important numbers? What are the goals of the organization?

So I always look back to the business goals. What are the OKRs or whatever goals you have for the business, either quarterly, annually, whatever? And then how does this organization support those goals? And does it mean that we need to develop more dashboards? Does it mean that we need to be delivering an AI Pipeline? Does it mean that we need to develop an entirely new infrastructure? What does it mean? And then building goals around that. So, working backwards from the business goals. And I think most organizations do that when it comes to– if you followed an OKR process, that’s sort of how it works is you cascade down into the organizations.

And so, in a lot of ways, it’s the same for a data team. The difference is there’s a lot of glue work that happens that’s under the radar that may or may not be other organizational enablement. So, there’s a lot of cross-functional work that you need to do. You need to be in lockstep with other leaders. And so part of what I love about being a data engineering leader back in the day was that you do get to work with leadership across the company and get to have your hand in a lot of different conversations, which is fun.

Rebecca:: So we were talking earlier that you have accumulated some experience leading through change. So I’m curious if you can talk a little bit about some of the maybe more challenging changes that you have had to carry a team through.

Colleen: Yeah. I think one thing that’s always been interesting to me is I love leading teams. I love making a team feel like a team and being like, “Hey, we support each other. We know what our goals are. We know what we’re trying to do this month, this quarter, this year.” And I think one of the biggest challenges is when you have to disrupt that. Why they’re hiring people, letting people go, adjusting a team, re-orging because you’re growing.

I’ve always worked at startups and, for the most part, they’re growing, and so you’re always hiring people and then your teams hit a tipping point, you need to re-org. And so I think the only constant is change. That’s what they say, and I think it’s true. So I think one of the biggest challenges for me and probably is for most engineering leaders, is the idea of maintaining that team spirit and that camaraderie and growth for the people in the organization while the organization itself is changing.

Rebecca: And as you have climbed the ladder, how has your perspective on, I hate to say versus, but the people versus the business changed? I know that when I was a baby engineering leader, that it was all about the people. And as I have become more senior, learning there’s the business, too. And these two things are our intention. Have you had an evolution of thinking when it comes to people in business?

Colleen: Yeah, I think I have. I think it’s akin to yours as well. When I first was managing people, my entire focus was on making them happy. I’m a people pleaser at some level and I wanted to make everybody happy. But you can’t make everybody happy all the time, and so you have to prioritize and pick and choose. And, for better or for worse, this is a somewhat capitalistic society and the business needs to win. We’re getting paid for a reason. And so the business definitely needs to be the driver of all these changes.

But, that said, having a huge degree of empathy and humanity is really important to me because, at the end of the day, it’s software, right? Not that it’s not important, but I am not saving lives in my day to day. My software might be used to but I am not. So people need to be first. And I’ve had a lot of leaders who were like, “Oh, you have a family problem? Family first, but also, when are you going to be back?” And okay, let’s think about our priorities. But I genuinely think that we need to remember it’s a job. We’re not a family. It’s not life. It’s a job. And having a broader perspective has helped me as a leader, even as I’ve become more mature in thinking that the business needs to be the main prioritization.

So, that said, my philosophy since I was a baby engineering manager was: happy engineers write good code. And there’s a lot to unpack from that. Like what does happy mean? So, in my mind, that means they enjoy working on their team, they understand what their work is going to be, they feel satisfaction most days, and we can celebrate when things go well, we can learn when things don’t go well, that kind of thing. Engineers can be any kind of engineer, anything from DevOps to data, to infrastructure, to software. And then good code, to me, it doesn’t just mean bug-free or performant. It means valuable to the business.

And that was my philosophy the first time I managed a team, and it’s still my philosophy because I genuinely think that’s how you lead an organization is to try to give the people what they need and enable them, but then also provide leadership and let them know what the expectations are. And so it all goes back to honesty, empathy, and communication.

Rebecca:: Yeah. I remember, it wasn’t an ’aha’ moment for me, but it was a thing that I realized I needed to say out loud that we don’t get paid to write clean code. We don’t get paid for perfect code or maintainable code. We get paid to provide value to the business. And sometimes it’s going to look all sorts of different ways. But I think, as an engineer, that can be a hard thing to let go of.

Colleen: Yeah. I have an academic background; I have a PhD. And so, for me, I have this perfectionist tendency that came out of that, I think, where I treat things like a science experiment and I’m like, “well, if the parameters aren’t perfect, then it’s going to fail.” And so I’ve had to very much rein that in on myself over the years.

If the engineers are writing perfect code, then I’m not doing my job. It doesn’t need to be perfect; it just needs to work. And that doesn’t mean that it can be garbage, either. It needs to be scalable and maintainable, all of those good things, but it does not need to be perfect. It shouldn’t be perfect.

Rebecca:: You just brought up a great way for us to close out. You have a PhD, and if I remember, it’s not in computer science.

Colleen: It is not. It is not.

Rebecca:: It’s something completely different than that. Maybe not completely different.

Colleen: No, I have a PhD in Astrophysics. And it was really fun and I got it almost twenty years ago now. It was great though, because really what I was doing was using some of the world’s most powerful radio and optical telescopes. I used Hubble and Very Large Array. And you get a lot of data. And what do you have to do? You have to clean the data and you have to join it with other data sets.

And we didn’t use databases. We used Python for things, but it was like Python v2, and it was very old. And there were a lot of very specific astronomy programs that were all written in Fortran. So I do know F77. What I was really doing was data engineering, managing pipelines of data from – it’s going to date me – tape. You would get a tape of your data. You would get telescope time, you’d go to a telescope, you’d take the observations, and it would produce a tape, and then you’d put it in your backpack and bring it back to school.

And then I would have to be a system admin because we didn’t have those and figure out my Linux distribution and all that. And I’d have to bring the data into my environment and do all this analytics on it and then produce research, and tell the story with the data. And so it’s literally, at its core, it is data engineering and analytics, but it was a fantastic primer for this industry, it turns out. And what’s funny is most of the people that I graduated with are now in data, data science. I mean, we all learned how to model, right? We all learned statistics and all that. So it’s a fantastic prep for a career in data and most of the people that’s what they’re doing.

Rebecca:: That’s very practical. You have to learn to do your job and to get the PhD. Well, Colleen, this has been such a treat. Thank you so much. Finally, we did it. Thank you so much for chatting.

Colleen: Thank you. Yeah. It’s been great. And I’m glad we finally got to chat. Literally been probably almost a year since we first had a conversation about doing this, so…

Rebecca:: Well, again, it’s been great. Thank you so much.

Colleen: Thank you, Rebecca.

Brought to you by
Helping you create better software development organizations, one episode at a time.
© 2023 Swarmia