Solo Data Engineer
NOTE: Originally published on my previous domain blog, blog.alkemist.io, which I’ve retired in case the .io domain goes away. Original publish date was 2023-01-03. Reproduced here with no edits, even though it needs some. :D
For the majority of my time as a data engineer, I’ve been a solo IC. When I transitioned to data engineering from DevOps land, I was the only data engineer for 18 months. And in an ironic twist, I had to train up our next data engineer, as if I knew what I was doing!
I hope this post provides some solace and actionable advice on how to make your time as a solo data engineer effective, tolerable, and perhaps even enjoyable while you wait for reinforcements to arrive.
My first recommendation is: define the borders of what you’re responsible for as clearly as possible. This is much easier said than done, especially on a small team. Chances are if you’re the only data engineer, you have 1 or more data scientists/analytics engineers/data builders as peers. You’re going to have to collaborate closely with them, so decide between you and your team where your data responsibilities end and theirs begin. Do they expect you to know the ins and outs of the data models and answer ad hoc data questions from end users? Do they not want you anywhere close to the data models since you’ll muck everything up, and they’re adamant that all data pipelines have to be run by 6AM every morning? Do they expect you to become a SQL performance expert and help them tune queries? All these, and more, are possible tasks that fall under your purview. The clearer picture you have of your responsibilities, the quicker you’ll know what problems are yours to solve, and what you should hand off to your peers.
The next bit of advice is: use managed services whenever and wherever possible. This one is a thorny and nuanced subject, but it will absolutely save you 100’s of human-hours in headaches if you judiciously pick what you build yourself, and what you outsource to managed services. When you start data engineering, there’s a magnetic pull towards wanting to understand every little detail of the stack you’re working with. This usually translates to building a lot of things in a bespoke way if there isn’t another engineer or mentor to tell you to knock it off. The data ecosystem is huge. You can’t deliver solid results if you’re forever tweaking Kubernetes settings or DBT runtime configs. I’ve been there. Early in my data engineering career, my idea of a sustainable ELT process was a “flexible” Python library that could scrape any Outh2 endpoint and write decorators around the transformed JSON response to the “data warehouse” (MySQL!). It makes me cringe/laugh time when I think back on it. The moment I got approval to deploy FiveTran my life became 100x easier and I could focus on real problems, like moving our artisanal SQLAlchemy transform code (written by yours truly) to using DBT. My early data platform was an absolute shit show. It got stuff done, to be sure, but boy did it suck if anything went wrong.
This next piece is a big repeat from all over engineering: KISDA - Keep It Simple Dumb Ass! Really assess what your org and team’s data needs are. Most of the time, early (and even late stage) data platforms boils down to 3 essentials:
- A visualization/dashboard layer. Looker, Power BI, Tableau, Metabase, and Superset are all examples of these. It’s the pretty dashboard you can share via links with everyone in the org to make them happy.
- A data warehouse. This is where data is written to and read from. It’s not only used by your visualization layer, but your analytics engineers and BI analysts too. The dashboard layer is usually too simple or limiting for your pro data users, so they’ll need direct access to connect via their own tools. It’s also the focal center of a lot of your data pipelines.
- Your data pipeline platform. This can be one system or a few systems talking to each other. It usually contains a managed extract and load function (think Fivetran/Singer/Stitch) and a transform function (DBT is king and really does it right in its own opinionated way). There’s always a scheduler/runner to ensure everything runs on the right cadence and in a timely fashion.
There’s some further additions to the stack that involve monitoring/alerting to ensure data is up-to-date and reliable within a given time period. Things get more complicated the more sophisticated your data platform gets: you want to track a lot of things like data provenance/lineage, and have better tooling around advanced DAG-based data pipelines. A simple stack that works well at the time of this writing (1/2023) if you’re building from a blank slate is:
- Looker
- BigQuery
- FiveTran/Singer/Stitch + DBT Cloud As a bonus, you can use Google Cloud managed Airflow, or opt for something like Astronomer.io.
Google Cloud has an absurdly good data offering that makes most other cobbled-together platforms look like a pile of turds, but sometimes you don’t get a choice of what to use, so YMMV.
The last parting bit of wisdom is… keep a sense of humor about it all. Even basic data engineering can deliver powerful insights to your team and org. You’ll impress everyone when you replace their daily e-mailed CSV’s with a dashboard URL that has freshly updated insights every day, as if by magic! Before you know it, you’ll the data engineer of legend everyone didn’t realize they needed in their lives!