How to Govern Data using Kafka & Avro

10 min readDec 17, 2018

Data are like trains: where should they go?

Working in a large company, my team and I are working with JSON data (let’s not talk about CSV and XML please). We had to find out what were the definitions and business rules of the models we received from other services. Trust me, it’s not that easy. Knowledge is sparse.

There is a ongoing work in a business unit dedicated to the standardization of the models across the company. It’s purely conceptual, without code. Each service implement the models it relies on. Nobody shares code. It’s a share-nothing architecture. While it’s a good technical pattern in general, it’s not a good pattern among a company.

Coming from startups, I’m used to work with shared Avro and Protobuf schemas: hence a rather delicate transition for me to work with “just” JSON. This made me raised some questions about JSON efficiency and data sharing policies across our services.

Data are ubiquitous. Data are the main driver in a company. Data evolve. How can we deal with them in a scalable and robust fashion while also being future-proof?

Big means less control

Unlike a small company that can adapt quickly, a large company doesn’t have the full control over its systems. It’s a bunch of autonomous teams, external systems, different countries with their own mechanics, and legacy systems to manage.

It’s very hard and time consuming to put everyone under the same flag, especially when people are dispatched in several countries, where the culture and habits can be different.

Big companies are often looking to plug all its resources together. It’s complex to combine everything while offering data coherence, security, and quality. That’s the mission of the architects. That’s the mission of the data governance.

Companies could decide to let each countries or entities alone, and only build some “pipelines” on top of them to communicate just the bare minimum they need. But that doesn’t remove the main issue: data must be shared.

“silver telescope on high place during daytime” by Eric Perez on Unsplash

Data & Services Governance

In a large company, it’s quite reasonable to state that all the data are not stored on the same system (like Hadoop or BigQuery) but scattered into multiple systems. Who controls what’s where? How do I know what exists?

If we go further, we have the same problem in a microservices environment. The data are “hidden” in front of facades. If our service needs to access to some user information, I need to access a User service. How do I find it? How do I register to it? How can I be assure it will always work: does it have SLAs?

This is a Information Governance (IG) problem, and especially a Data Governance problem. It’s a need to ensure the data quality, availability, usability, consistency, integrity, and security of the data across a company.

Data Governance can provide an “Enterprise Data Model”. The Data Owners thinks about the business and structure it for anybody to rely on it (to expose and consume the data). This becomes the standard inside the enterprise: its Ubiquitous Language.

Data Governance is a never-ending strategy because:

business evolves
technology evolves
laws and regulations evolves (GDPR)
people come and go.

Multiple actors help to define the Data Governance strategy: the Data Owner, the Data Managers, the Data Architects… A complete hierarchy from top to bottom.

Data Discovery

Huge companies full of Data Engineers and Data Scientists have systems that allow for discovery and maintainability of their data. It’s mandatory when you have thousands of them. You better be well organized.

A company needs to provide a holistic view of its data. It doesn’t want to have silos that don’t share anything or don’t reuse what’s already been done by other teams. It needs a system that provides a unified and consolidated view of the data, like a PIM (Product Information Management) but for any of its resources.

Data have a life-cycle. They are created one day, evolve over time, and are eventually deprecated then removed.

Data have metadata: about their existence, ownership, volume, origins, etc. The same reasoning goes for the services (they are exposing the aforementioned data).

For instance, at Spotify, they are using a handmade system called System-Z to build and organize their services and deal with their metadata, while allowing other team to discover them.

Keep it simple, stupid: JSON

JSON is ubiquitous because it’s simple to generate, parse, and work with (ie: understanding, debugging). All languages have libraries to process it. Nobody “owns” JSON. You serialize any payload, the other side is aware of the schema (through emails, Google Docs/Sheets, or just through self-discovery), it deserializes it with its own code, and you’re done. You just need to know what are the fields signification (that’s not always straightforward).

JSON enables a share nothing architecture. You don’t create any tension between services. You don’t need any centralized code repository. You don’t need the applications to depend on a third-party to get the schema of the data you need to read or send (note: https://json-schema.org/). Each service can own its own part.

“white leaf plants covered with tall trees” by George Hiles on Unsplash

What about centralizing data schemas?

In order to provide more robustness, to offer some centralization and more controls, it’s possible to use a schema-based architecture based on Avro.

Avro provides a language-agnostic referential of data schemas.

The data owner can ensure their schemas are backward & forward compatible to ease consumers & producers constraints.

For instance, a consumer relying on the v2 schemas could consume the data produced by the v10 schemas. Or the other way around: a consumer relying on v10 schemas could read messages produced using v1 schemas. Those are nice — must — features to have. We don’t need to do a big-bang release — to stop everything; flush the queues;restart — each time a field is added or removed.

In JSON, we often have the backward compatibility for “free”: if we produce a new field, the old consumers ignore it — personally, the program should crash because a field is not recognized.

But this is just a happy side-effect: nothing is really controlled, validated. This is why we need schemas and a schema validation process.

—

Schemas are used when reading and writing data. The schemas must be shared between services that transmit data to each other — they can also be embedded inside every messages but that defeats efficiency. This adds some coupling in favor of robustness and common knowledge but also raises a lot of questions:

Who hosts the schemas?
Who is the owner of the schemas? Who can alter them?
Who can consumes them?
How do we control who access what? Can we audit?
Can we freely explore the schemas? How?

Either JSON or Avro schemas, we quickly drift to a data sharing issue across the whole company and more globally Data Governance issues.

Producers and consumers linked to a common schemas repository

Sharing data through Kafka using Avro

Apache Kafka is a software to transport data across services and applications. It became common in all sort of companies since a few years because it’s very performant, simple to use, answer to many use-cases, and enforce “simpler” designs reducing coupling.

From a single application, Apache Kafka started to provide more tools and turned itself into a Streaming Platform where we can also transform our data and join external systems through the usage of Kafka Streams (to transform a stream of data and republish it into Kafka) and Kafka Connect (to send a stream of data into a database such as Elasticsearch).

Apache Avro comes from the Hadoop world where we needed a way to save, transport, and query data efficiently while dealing with versioning.

Both are built upon mechanical sympathy: they take advantages of the underlying hardware — such as caches, access patterns, pipelines — to offer great performances.

We can combine them to get an efficient transport over efficient data.

Avro

Without going into details, Avro is performant because its payload:

contains only data: no field names, no noise, just the bare data.
is schema-based: to know how to read/write the data. The schema is never written inside the messages—except in a .avro file, it can be written in its header — but provided by some external dependency when the time comes to deserialize it: the deserializer only gets a SchemaID from the message. It needs to fetch the complete schema from the registry and cache it.
is binary: it’s not meant to be read by humans. Hopefully, there are some tools to read .avro files or messages passing through Kafka such as avro-tools.jar or kafka-avro-console-consumer from the Confluent Platform.

Avro also:

Supports compression such as Google’s Snappy.
Supports projections: it won’t deserialize all columns if we only ask for a specific one.
Causes a low GC pressure.
Is not a columnar-storage such as Parquet which compress better, supports filter pushdown predicate, and much more, but is more adapted for big chunks of data.

Schemas

A schema is not simply a list of field names. It also contains metadata for each field: its description and its type (and aliases). Avro provides specific types: int, string, boolean, enums, arrays, maps, and also null (it must be specified, a strings are not nullable by default).

As JSON, Avro is agnostic. Any languages can send/read Avro messages, as soon as it have its schema. A schema is a simple text file (often written in JSON!).

A simple User model could be represented this way in Scala:

case class User(id: long, name: String)

And its Avro schema would look like this:

{
  "type": "record",
  "name": "User",
  "fields" : [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"}
  ]
}

It’s possible to share such schemas across projects by creating a dedicated versioned git repository. Any project can either use git submodules to import it or directly build an artefact and use the its dependency manager to import the compiled schemas.

Thanks to this, any project can depend on a schema at compile-time (let’s ignore language such as Javascript which does runtime checks only). It means that if the project compiles, we’re almost good to go into production because we are using the “official” schemas of data: we will be compatible with anyone producing or consuming them (if the schema is compatible of course, as we’ll see right below).

Compatibility

As we said: Data evolve. Avro’s biggest selling-point is its schema evolution management (the second selling-point being its tiny size then fast serde). How lucky we are!

When a schema on the git repository has changed, it does not mean it’s backward or forward compatible: git does not have the intelligence to validate that.

This is where the Schema Registry comes into play.

It’s a small application (typically a simple REST app with a database) that exists only to expose existing schemas and validate that the new schemas are compatible.

When a data producer starts, it will register the schema it is using (if not yet inside the registry): this is where the registry can reject its ask because — if incompatible — it could break potential consumers. Imagine a field was removed: what if a consumer (not updated yet) depends on it?

Enforcing the compability through a third-party component is a must-have. As we know, to err is human.

You can check out my article if you want to grasp the technical details of Avro and the Schema Registry:

Serializing data efficiently with Apache Avro and dealing with a Schema Registry

Apache Avro is a must nowadays, but it’s rarely used alone. For the sake of schema evolution (any business have…

www.sderosiaux.com

Conclusion

This was just a prelude to the huge world of a Data Governance and the combination of Kafka and Avro. It’s not a “you should do” article, but more “you could do”.

Combining the Kafka Platform and Avro gives us a bright and stable future:

A performant and robust way of exchanging data.
A total decoupling between producers and consumers: a service doesn’t need to manage who consumes its data.
Governance: with the Schema Registry, you can list all the type of data available and their schemas (and their evolutions). Nobody needs to ask you “what is the format of your data? send me an excel please”.
Future-proof: with Avro, you ensure nobody can inject “bad” data that could break consuming services.

But that also raises tons of questions:

Who owns what? Kafka? Its data? The Schema Registry? Its schemas?
If you share something, you must define SLAs.
The data and schemas accesses and modifications needs to be controlled (ACLs)
What happen when you really need to break the schema compability?