The key to saving infrastructure and data governance from disaster

The hidden risks of poor data governance and how to mitigate them

4 min readOct 1, 2024

I love tagging! Don’t ask. I already wrote about it in a previous post. It was about Data Mesh, ABAC, ontologies, good things! Here, I want to deep dive into their practical impact.

In my experience dealing with complex environments, governance is often applied too late, and without a solid framework, welcome to inefficiencies, overspending, and compliance issues down the road. You know what is the simplest, most effective way to get that under control? Tags and labels. Looks dumb, no?

These might seem trivial at first glance, but once you start applying them correctly, on your infrastructure, on your data streams, on your devportal, on your various SaaS, etc. they become one of the most powerful tools in your arsenal. Let me walk you through how they can make a difference.

Money, money, money! And ownership.

In any large-scale environment, you will lose visibility of who’s using what when on what thing and why. Teams spin up resources, forget about them, or over-provision just to play it safe. I’ve seen this happen time and time again. Implementing tagging system is a game changer here because of one word: OWNERSHIP.

For example, by tagging resources based on project, department, or environment (dev, test, production), I can get a clear picture of exactly where my money is going. It allows for real chargebacks, where teams are responsible for their own costs, instead of hiding behind a shared pool of resources. (This is generally quite hard to do and only makes sense at large scale to not slow down operations)

Example from Conduktor, specialized in Data Management for Kafka, showcasing consumption by usage and value per Service/Owner:

Chargeback (or Showback) is used not only to track cloud/data spend across teams but also to hold them accountable of what they’re doing. It completely changes the dynamic when the managers or finance team can point directly to who’s or what’s driving up costs, instead of guessing. It can help Platform teams with the data infrastructure they put in place with a self-service approach for developers.

eg: one project I oversaw had multiple storage services tagged as “archived data”. We found out most of this data hadn’t been accessed in over a year. Simple tagging allowed us to identify this and reduce our storage costs by 15% within a quarter.

Compliance Auditing and Security

When you’re managing data that falls under regulations like GDPR or HIPAA, knowing where your sensitive data (emails, credit cards, PII (Personally Identifiable Information) in general etc.) lives is mandatory. In my experience, this is where tagging really shows its value. Your Head of Security and Compliance will be happy if they can have a clear view at any point-in-time of what’s happening!

By tagging assets/data that convey sensitive information, it’s possible to tighten access controls and easily demonstrate compliance during audits (e.g. exporting all the authentication and authorization policies for all assets regrouped by tags).

Streams of data being tagged in Conduktor

In the data world, it’s absolutely necessary to tag tables, streams, fields that handled PII. This way, they can isolated, encrypted, and had stricter access controls… and you can prove it! When the auditors came in, it took us minutes to produce the necessary reports, it’s a simple .csv export.

Automation & Statefulness

One of the best outcomes from using tags effectively is how they enable automation. When resources are properly labeled, it’s far easier to set up workflows that automatically scale systems based on demand. This has saved me and my team countless hours of manual management or custom developments.

For example, I worked on a product where customer-facing apps were tagged as “spiky_high_traffic”. An orchestrator, based on these tags, was automatically scaling them up during peak times and scaling down during quieter periods. The services were never over-provisioned, and we never compromised on user experience. This tag can be used by other services with similar use-cases and it’s therefore easy to identify them.

One particular aspect I also value about tags/labels is that they are just metadata, like a Map[String, String] on any item basically. It means they can convey information like time aspects (see below). They can be interpreted by someone or something (automation, GitOps, scheduler) and have be stateful. No need of an external database, each item is a mini-database by itself.

start_at: 09:00
shutdown_at: 18:00
daily_restart: 02:00
expire_after: 30d
auto_archive_at: 2024–12–31
restart_interval: 15m
run_during: 22:00–06:00
…

Who needs a database now?

Say yes to tags and labels

From my experience, it’s clear: tags and labels are not optional to manage any infra/data effectively. They give precision control over costs, compliance, automation, transforming chaos into clarity.

If you want to tag your Kafka data streaming resources, don’t hesitate: Conduktor.