If you work with data pipelines that feed the right data to the AI pipelines, which you use to train your models, there are a few things worth noting to ensure that you don’t mess up.
It’s satisfying to work with the perfect data, but not an ideal scenario to be honest and there could be a lot of things that can go wrong. Sad part is , most of them are not obvious until you’ve been burned by them. In this article, I’m about to save you some pain.
Skipping Data validation at Ingestion:
The data that enters your pipelines cannot be garbage. If your raw data is messy and you let it in, the mess flows silently through every pipeline. That is slow poison for your models in the downstream. Handle the data validation right at the ingestion stage, check for missing values, wrong data types, and duplicate records using schema checks, null checks, and range validations at the first stage of your pipelines. Great Expectations or dbt tests make this a lot easier. Although it feels annoying and slows things down, you would not want to find out what slipped through at the later stages.
Data Lineage:
Data lineage is an audit trail of your data. It would be able to point out the source, transformations, and everything you need when you are wondering, “Where did this number come from?” in your dataset.
Without Data lineage, debugging a bad model is worse than tracing a rumour back to its origin. You would only put in efforts and get no returns. OpenLineage, dbt, or Apache Atlas automatically tracks lineage so you have the map of your data journey.
Building pipelines that fail silently:
This is the most dangerous. What’s worse than a failed pipeline is a successful pipeline that is a failure. If everything looks good, and when an upstream source quietly changes its schema, if you do not know about it, that could result in silently feeding corrupt data to the ML models in the downstream. Failed pipelines are far better than something like this. At least it alerts you to make the changes needed. That said, build an alerting for schema changes, unexpected nulls, sudden drops in row counts, and data freshness. Let it fail louder, faster, sooner.
Complicate everything chasing perfection:
It feels very ambitious to come up with a grand distributed system with Kafka, Spark, a data lake, a feature store, and a real-time dashboard before you have even written your first transformation, but this is a real trap. You are going to exhaust your energy with this kind of complexity that feels like dealing with the debt you haven’t paid yet. Start simple, start small instead. Build a basic pipeline and scale it whenever the demand or need arises.
Treating the production data and training data the same way:
This could result in an inconsistent model behaviour. If your model is trained on one distribution of data but sees different data in production in terms of formats or preprocessing logics, it could easily make bad decisions. You can think of it like a student who studied a different subject for an exam. Terrible, I know! The fancy term for this problem is “training-serving skew”.
Use a feature store to ensure the same feature logic runs during training and serving. Document and test the preprocessing steps carefully.
Not versioning the datasets:
Data should be versioned with tags, document the changes and store snapshots. There are tools like Data Version Control that make this easier. Otherwise, on retraining a model, you may notice that the performance dropped, and no one knows what changed in the data.
Skipping data partitioning:
For large tables, if the data is not partitioned by date, region or some other meaningful key, you might end up accidentally scanning terabytes of data every time someone runs a query. This can only make things slow and expensive. A treasure hunt would take less time than the query execution (Alright, I know that’s a lame exaggeration! But you know what I mean, right?). Partition by the most common filter column( usually a date field), and add clustering on top if needed. Always think about how the data would be queried before deciding how to store it.
Hardcoding Credentials:
DON’T DO THAT EVER! Use environment variables, secrets manager or your platform’s built-in secrets handling.
Skipping monitoring in Production:
If you don’t watch the data drifts, model degrades, or volume drops, till it gets noticed in production when it’s too late to fix. Always monitor data volume, schema changes, model input distributions and output confidence scores. Set thresholds to get alerts before things get critical.
Ignoring the handshake between data and ML pipelines:
Building data pipelines and ML pipelines should not be seen as two separate use cases; rather, they should be designed more collaboratively. The features the model needs, and its latency should be understood while building the data pipelines as well. If two different teams handle it, then collaboration is the key.
You don’t need to know every tool, you just need to know what each tool that you are already using is solving. The ones who stay curious, ask questions and care about the data quality can easily get ahead of lot of people in the field.
Found this useful? Share it with someone just getting started in data engineering. And if you’ve made any of these mistakes yourself , you’re in good company.