The Essence of Data Engineering

NOTE: Originally published on my previous domain blog, blog.alkemist.io, which I’ve retired in case the .io domain goes away. Original publish date was 2022-10-31. Reproduced here with no edits, even though it needs some. :D

What is data engineering?

This question is almost as unanswerable as it is pointless to try. But for the purposes of our journey, it’ll be helpful to figure out how one specific person who holds the title of data engineer defines it.

This definition will steer clear, insomuch as is possible, of the perpetually changing job and role definitions that pervade the data space. No haggling over where data engineer starts, where analytics engineer begins, or the overlap between the two. That’s the subject for a later post.

In essence, the elixir of data engineering breaks down to:

1 part science
1 part art
1 part mystical catalyst

I know some of you are disappointed, but this is how I define the mindset and attitude I take towards tackling most data engineering problems. Let me elaborate a bit more on this. Before I do that though, let’s take one step back and define (as is popular these days) the Why of data engineering.

Thankfully, I ascribe to a very simple definition and it’s one I see shared across the data engineering community. The Why is: to ensure Data Scientists, Data Analysts, or other data builders can rely on the data they work with. They can assume the data is reliable, clean, and up-to-date.

Now that we have the Why, let’s swing back to the essence of data engineering… something we can call the How.

You need 1 part science, since it’s crucial you document and approach problems in a systematic way. You have to break them down to their base components to really see what you’re working with in detail. Is it a CSV? A JSON file? A database endpoint? Some combination of the above? What kind of analysis tools are you using and are they giving you the raw feed picture, or do they make some assumptions for you (default LIMIT bound statements in most database clients is a perfect example when you run a COUNT/SUM with GROUP BY for the first time and don’t realize your data set does contain more than 100/500/whatever LIMIT results).

The art aspect comes in when you have to creatively, yet coherently, bring different data sources together into a single source and in a coherent way. Despite the veritable explosion of amazing tools created just for this purpose in the past decade… it’s still next to impossible to do this with one product or paradigm. And you have to decide how to blend and mold the data sources so they make sense within the context of each other. A good data engineering platform should evoke the sense of Mona Lisa’s mirth, not the bewilderment of a Jackson Pollock.

The mysterious catalyst element involves a certain amount of persistence (faith) that there is a relatively small set of tools you can use that will performantly and reliably bring your science and art together in a consistent way. I’m using the word catalyst in the way defined in chemistry: it has to lower the activation energy of your data pipeline reaction (combining the science and art). It’s okay if your POC data pipeline looks like the Sistine Chapel, but if you can’t create a new Sistine Chapel with one or more different attributes quickly and reliably (and within budget) every day or every few hours, your POC is mostly a vanity exercise. So you need to find the mysterious catalyst that will make the whole thing come together into a functional whole.

To me, this mirrors the ancient “natural philosophy” of alchemy. You pretty much know what you’re looking for, and have a solid understanding of basics… but the end results will often surprise and delight you. You’ll also come across times when your whole lab will explode. It’s a great time to be in this space, and I hope some of my rantings can entice you to join this path, and maybe entertain you at times too.