Skip to main content

From Classroom Question to Enterprise Pattern: Rethinking AI Retrieval on Governed Data Platforms


    This started while I was preparing my Big Data labs for my students. I wanted something closer to reality. Not only pipelines and Spark jobs, but how AI agents would actually interact with a governed data platform.

So I set the environment as it should be in a serious setup. A lakehouse governed with AWS Lake Formation, metadata centralized in AWS Glue Data Catalog, and Spark handling execution. Clean, controlled, auditable.

Then I added the missing piece. An agent that needs to understand intent and retrieve context, not just run queries.

And almost immediately, the same request appeared. “We need a graph database for ontology.” In AWS terms, that means Amazon Neptune.

I see this pattern often, not only with students. Also in real projects. Someone comes with a solution already decided.

I always give the same answer. What is the business problem you are trying to solve?

Because “I need Neptune” is not a requirement. It is a conclusion.

When you force the conversation back to the problem, things become clearer. What people actually want is not a graph database. They want semantic understanding, better retrieval, and AI-friendly access to data.

When I tried to integrate Neptune into the lab architecture, the problem became obvious. It sits outside the governance model. It does not work with Lake Formation. It is not part of Glue Catalog. It introduces its own access control, its own metadata, its own lifecycle.

At small scale, you can make it work. You can connect systems, write adapters, manage permissions in two places. It works in isolation.

At enterprise level, this becomes an antipattern.

You end up maintaining two different systems for governance. Two different models for RBAC. Two sources of truth for metadata. And eventually, they drift. They always drift.

This is not a technology problem. It is an architectural one.

So I went back again to the original question. What is the minimum needed to solve the real problem without breaking what already works?

The answer was not to add a new system. It was to extend the current one with discipline.

Keep governance where it already is. Keep metadata where it already is. Do not duplicate control planes.

Then introduce a semantic layer aligned with that governance model, and use FAISS only as a retrieval accelerator.

The key constraint is simple and strict. You only generate embeddings from data that has already passed through Lake Formation. You never use the vector index to decide access. You use it only to rank and retrieve.

This removes the need to replicate RBAC. It removes the need to synchronize metadata. It keeps everything consistent.

When I translated this into the lab, the flow became natural. The agent maps intent, Glue resolves structure, Lake Formation enforces permissions, Spark retrieves data, embeddings are created, FAISS retrieves similar context, and the response is built.

No parallel systems. No duplicated governance.

What started as a teaching exercise exposed a real enterprise issue. The tendency to jump to technology instead of understanding the problem.

Now when I get the request again, “we need Neptune for AI ontology”, the answer is still the same.

What is the business problem you are trying to solve?

Because if the answer is retrieval, context, and AI interaction over governed data, then introducing a separate system that cannot be governed or catalogued is not a solution.

It is technical debt in advance. I started this trying to design a better lab. I ended up with a pattern I would use in a real enterprise platform without hesitation.


Comments

Popular posts from this blog

Análisis de la película “K19” desde la perspectiva de Braybrooke y Lindblom

Introducción En las siguientes páginas se analizará una película dirigida y producida por Kathryn Bigelow llamada “K19 – The Widowmaker” estrenada en el año 2002, basada en hechos reales que fueron ocultados durante 30 años en la extinta Unión Soviética. Para dicho análisis se utilizará como base teórica el capítulo 5, “The Strategy of  Disjointed Incrementalism”, del libro Strategy of Decisión de David Braybrooke y Charles E. Lindblom. El motivo de dicho análisis es intentar relacionar la teoría del pensamiento estratégico en la toma de decisiones según la perspectiva de la lectura. 

Análisis de la película “Thirteen Days” desde la perspectiva de Weber y Graham T. Allison

Introducción En las siguientes páginas se analizará una película dirigida por Roger Donaldson llamada “Thirteen Days” estrenada en el año 2000, basada en hechos reales que fueron vividos a nivel mundial donde se vieron involucrados tres países, Cuba como el foco central de la discordia entre los Estados Unidos y la Unión Soviética. Para dicho análisis se utilizará como base teórica los puntos 1 y 2 del Tomo I del libro Economía y Sociedad de Max Weber. Adicionalmente se tomará el Capítulo 1: “Model I: The Racional Actor” del libro Essence of Decision. Explaining the Cuban Missile Crisis de Graham T. Allison. El motivo de dicho análisis es intentar relacionar la teoría de la dominación y de los modelos conceptuales para la toma de decisiones según las perspectivas de las lecturas.

Análisis de la película “Tiempos Modernos” desde la perspectiva de Crozier, Friedberg, Taylor y Fayol

Introducción En las siguientes páginas se analizará una película de Charlie Chaplin llamada “Tiempos Modernos” producida por él en EEUU en el año 1936. Para dicho análisis se utilizará como base teórica la introducción, “Les contraintes de l’ action collective”, del libro L´Acteur et le Systeme de Michel de Crozier y Erhard Friedberg, adicionalmente se usará el capítulo 17, “Introducción al Fayolismo I. Hipótesis. Proposiciones técnicas. La discusión con el Taylorismo”, del primer tomo del libro El Pensamiento Organizativo: del Taylorismo a la Teoría de la Organización . El motivo de dicho análisis es intentar relacionar la teoría del pensamiento organizacional según la perspectiva de las lecturas.