Starting in Data Engineering

NOTE: Originally published on my previous domain blog, blog.alkemist.io, which I’ve retired in case the .io domain goes away. Original publish date was 2022-12-04. Reproduced here with no edits, even though it needs some. :D

Getting started in data engineering can be daunting, especially since it can mean different things in different environments. Sometimes it’s generating raw analytics from a single data source like a database: you set up a read-only database replica with some data-visualization tool like Looker on top, and you’re done. Other times you need to figure out how to do real-time data analytics against live incoming data streams. The variations are somewhat domain specific. This post will hopefully equip you with a general enough toolkit to be able to find solutions as an aspiring data engineer.

Let’s start with tooling, since it’s pretty simple. Exactly what you hear as the general wisdom is largely true: learn Python, learn SQL, learn Git.

The nice thing about Python is it gives you a whole toolkit, not just for data processing and data pipeline creation, but doing random one-off tasks as well, or even creating full-fledged API endpoints. The running joke is that Python is the 2nd best language for any given task, and it largely holds true. Python is rarely the most efficient language for any given task, but you can POC and deploy code with very low barriers and you’ll find an integration or library for almost anything you want to do. The starting developer experience around Python can be rough, but on a relatively modern Python version (3.10 or later as of this writing), and using something like HomeBrew for runtime installation and Poetry for dependency tracking, you’ll be able to get started without too much trouble. As a point of reference of where Python can take you: when Instagram was purchased by FaceBook, it was still powered entirely by Python. So unless you’re starting out at Instagram scale, Python will serve you well for a considerable part of your career.

Next is SQL. Specifically, install PostgreSQL and start loading data sets into it and experimenting with queries that stress you out in a safe environment (like multiway joins or window functions). PostgreSQL is, simply put, a beautiful piece of engineering that can be optimized to run well as a data warehouse for moderate sized data. It also has lots of built-in goodies like JSON parsing against JSONB data type (JSON-Binary, which is more efficient than plain-text JSON) for quick querying of semi-structured data. You can also find it readily available as a Docker container or install it locally (also via HomeBrew). A large percentage of data warehouses use PostgreSQL wire protocol for connecting, so you’ll already be conversant in interacting with larger-scale analytical data stores when you encounter them.

Every technical person needs to know and understand version control systems, and Git is the most popular and prevalent one right now. Embrace learning at least its basic concepts and commands. You can host your code and projects for free on GitHub. It’s also an excellent place to start building out your data engineering project portfolio!

Now comes the harder part… how to start thinking about approaching data problems. What are common patterns? What tools and systems should you use for a platform? The answers are varied, especially as of the time of this blog post writing, since while there’s been some consolidation of data tools and maturing “best in class” options, there’s still a broad field of competitors and new ones popping up every day.

A good starting place is Tobias Macey’s excellent Data Engineering Podcast. Don’t be afraid to listen to earlier episodes, even if the content is slightly out of date. Tools like DBT that were covered 4 years ago are still relevant. Patterns and approaches from different tools will give you insights about what to further research and look up. New terms and methodologies will pop into your vernacular, and as you work as a data engineer, more and more of the pieces will fall into place. A book that recently came out that I haven’t had the chance to delve into yet, but is mostly guidance without specific tooling recommendations, is Fundamentals of Data Engineering. I heard the authors being interviewed, and from scanning the book, it holds a lot of promise. I’ll provide a follow-up when I’ve had a chance to read thru it more.

One common pattern that tends to come up a lot is ETL/ELT, which stands for Extract-Transform-Load (ELT is the more modern version, since it’s usually cheaper and more efficient to Extract-Load data and then Transform it after the fact). In practice this usually means you Extract from an external source, this can be an API, a CSV file, and database; Load it into your data warehouse or data lake, usually in a raw i.e. unaltered carbon copy format; and Transform it by either re-arranging the data to a more suitable queryable form, or combining with other Loaded data sources into a combined data source that can then be further utilized for building data models or exploration. Thinking about the basics of a data pipeline this way can give you a good but basic system-level approach to a lot of data problems. Be aware that this does primarily apply to the batch compute paradigm (another way of saying non-streaming data processing). Streaming analytics have their own nuance and patterns separate from ELT.

One final recommendation for starting out: have fun with what you’re doing! And be patient with yourself. Data engineering is a challenging discipline at times, but the tooling and guides out there are getting better every day! If you’re looking for some awesome communities to join consider: DBT Slack or Locally Optimistic (which is a bit more Analytics Engineering focused, but there’s a good bit of overlap).