From Classroom Question to Enterprise Pattern: Rethinking AI Retrieval on Governed Data Platforms

This started while I was preparing my Big Data labs for my students. I wanted something closer to reality. Not only pipelines and Spark jobs, but how AI agents would actually interact with a governed data platform.

So I set the environment as it should be in a serious setup. A lakehouse governed with AWS Lake Formation, metadata centralized in AWS Glue Data Catalog, and Spark handling execution. Clean, controlled, auditable.

Then I added the missing piece. An agent that needs to understand intent and retrieve context, not just run queries.

And almost immediately, the same request appeared. “We need a graph database for ontology.” In AWS terms, that means Amazon Neptune.

I see this pattern often, not only with students. Also in real projects. Someone comes with a solution already decided.

I always give the same answer. What is the business problem you are trying to solve?

Because “I need Neptune” is not a requirement. It is a conclusion.

When you force the conversation back to the problem, things become clearer. What people actually want is not a graph database. They want semantic understanding, better retrieval, and AI-friendly access to data.

When I tried to integrate Neptune into the lab architecture, the problem became obvious. It sits outside the governance model. It does not work with Lake Formation. It is not part of Glue Catalog. It introduces its own access control, its own metadata, its own lifecycle.

At small scale, you can make it work. You can connect systems, write adapters, manage permissions in two places. It works in isolation.

At enterprise level, this becomes an antipattern.

You end up maintaining two different systems for governance. Two different models for RBAC. Two sources of truth for metadata. And eventually, they drift. They always drift.

This is not a technology problem. It is an architectural one.

So I went back again to the original question. What is the minimum needed to solve the real problem without breaking what already works?

The answer was not to add a new system. It was to extend the current one with discipline.

Keep governance where it already is. Keep metadata where it already is. Do not duplicate control planes.

Then introduce a semantic layer aligned with that governance model, and use FAISS only as a retrieval accelerator.

The key constraint is simple and strict. You only generate embeddings from data that has already passed through Lake Formation. You never use the vector index to decide access. You use it only to rank and retrieve.

This removes the need to replicate RBAC. It removes the need to synchronize metadata. It keeps everything consistent.

When I translated this into the lab, the flow became natural. The agent maps intent, Glue resolves structure, Lake Formation enforces permissions, Spark retrieves data, embeddings are created, FAISS retrieves similar context, and the response is built.

No parallel systems. No duplicated governance.

What started as a teaching exercise exposed a real enterprise issue. The tendency to jump to technology instead of understanding the problem.

Now when I get the request again, “we need Neptune for AI ontology”, the answer is still the same.

What is the business problem you are trying to solve?

Because if the answer is retrieval, context, and AI interaction over governed data, then introducing a separate system that cannot be governed or catalogued is not a solution.

It is technical debt in advance. I started this trying to design a better lab. I ended up with a pattern I would use in a real enterprise platform without hesitation.

Manuel Hernández Giuliani

Search This Blog

From Classroom Question to Enterprise Pattern: Rethinking AI Retrieval on Governed Data Platforms

Labels

Comments

Popular posts from this blog

Análisis de la película “K19” desde la perspectiva de Braybrooke y Lindblom

Análisis de la película “Thirteen Days” desde la perspectiva de Weber y Graham T. Allison

The apocalyptic AI narratives era