GCP AI Assistant

March 31, 2026

GCP AI Assistant

An AI Engineering portfolio project that combines Retrieval-Augmented Generation (RAG) over official GCP documentation with a MCP Server for real-time cloud resource interaction — all running locally via Docker.

Ask questions about Google Cloud in natural language and get answers grounded in the official documentation. No hallucinations, no guessing.

Features

RAG Pipeline
- Scrapes 24 GCP AI/ML service documentation pages automatically
- Chunks documents using header-aware Markdown splitting
- Indexes ~2,594 vectors locally in Qdrant — no embedding API required
- Retrieves semantically relevant chunks and generates grounded answers via Groq/Llama
S3 Pipeline Control
- Stores raw scraped docs in a LocalStack S3 bucket
- Moves processed docs to a separate bucket after indexing
- Idempotent re-runs — skips already processed files
REST API
- FastAPI endpoint POST /ask/ with automatic Swagger docs
- CORS configured for local frontend development
React Frontend
- Anime noir aesthetic inspired by Persona 5 and Cowboy Bebop
- Chat interface with query history
- Served via NGINX in Docker
MCP Server (in progress)
- Intent Router — automatically decides between RAG and real GCP actions
- Cloud Storage — list buckets
- BigQuery — list datasets
- Compute Engine — list VM instances

Prerequisites

Before you start, make sure you have:

Docker Desktop (running)
Python 3.11+ (for the setup pipeline)
Node.js 20+ (for frontend development only)
A Groq API key (free tier — no credit card required)
A LocalStack Auth Token (free tier)

Installation

1. Clone the repository:

git clone https://github.com/your-username/gcp-rag.git
cd gcp-rag

2. Configure environment variables:

cp .env.example .env

Open .env and fill in your API keys:

LOCALSTACK_AUTH_TOKEN=your_localstack_token
GROQ_API_KEY=your_groq_api_key

The remaining variables have sensible defaults and don't need to be changed for local development.

3. Start the infrastructure:

docker compose up -d

This starts 4 containers: LocalStack (S3), Qdrant, FastAPI, and NGINX.

4. Install Python dependencies:

pip install -r requirements.txt

5. Run the scraping and ingestion pipeline:

python main.py --setup

This will:

Scrape 24 GCP AI/ML documentation endpoints
Chunk and embed all documents locally (no API calls)
Index ~2,594 vectors into Qdrant
Move processed files to the processed S3 bucket

This step takes a few minutes on first run. Subsequent runs skip already-processed files.

6. Open the interface:

Go to http://localhost:80 in your browser.

The FastAPI Swagger docs are available at http://localhost:8000/docs.

Usage

Chat interface:

Open http://localhost:80 and ask anything about GCP:

What is Vertex AI?
How does Dialogflow handle intent detection?
What are the differences between Dialogflow CX and ES?
How do I use Cloud TPU for training?

API directly:

curl -X POST http://localhost:8000/ask/ \
  -H "Content-Type: application/json" \
  -d '{"question": "How does Vertex AI handle model versioning?"}'

Response:

{
  "answer": "Vertex AI Model Registry allows you to..."
}

Architecture

User
  ↓
Frontend (React + NGINX :80)
  ↓
FastAPI (:8000)
  ↓
RAG Pipeline
├── Retriever  →  Qdrant (:6333)  →  top-5 chunks
└── Generator  →  Groq / Llama 3.3 70B  →  response

Setup Pipeline (python main.py --setup)
├── Scraping   →  crawl4ai + BFS  →  LocalStack S3 raw bucket
├── Chunking   →  MarkdownHeaderTextSplitter + RecursiveCharacterTextSplitter
├── Embedding  →  sentence-transformers/all-MiniLM-L6-v2 (local)
└── Indexing   →  Qdrant (2,594 vectors)

Tech Stack

| Layer | Technology | |---|---| | Scraping | crawl4ai + BFSDeepCrawlStrategy | | Object Storage | LocalStack S3 (raw + processed buckets) | | Chunking | LangChain MarkdownHeaderTextSplitter | | Embeddings | sentence-transformers/all-MiniLM-L6-v2 | | Vector Store | Qdrant | | LLM | Groq — llama-3.3-70b-versatile | | Orchestration | LangChain LCEL | | API | FastAPI + Uvicorn | | Frontend | React + Vite + NGINX | | Infrastructure | Docker Compose |

GCP Services Covered

The assistant covers 24 GCP AI/ML services across 8 categories:

| Category | Services | |---|---| | ML Platform | Vertex AI, Vertex AI Generative AI | | Generative AI | Gemini API | | Conversational AI | Dialogflow CX, Dialogflow ES, Agent Builder, Agent Assist, Contact Center AI | | Vision | Cloud Vision API, Video Intelligence API, AutoML Vision, Vertex AI Vision | | Natural Language | Natural Language API, Cloud Translation, Healthcare NL AI | | Speech | Speech-to-Text, Text-to-Speech | | Document AI | Document AI | | ML Infrastructure | Cloud TPU, Deep Learning Containers, Deep Learning VM | | Data for ML | Timeseries Insights API, Recommendations AI, Vertex AI Search for Retail |

Project Structure

gcp-rag/
├── api/
│   ├── main.py              # FastAPI endpoints + CORS
│   └── Dockerfile
├── config/
│   └── config.py            # Environment configuration
├── frontend/
│   ├── src/App.jsx          # React chat interface
│   └── Dockerfile
├── mcp/
│   ├── tools.py             # GCP SDK tools (@tool decorators)
│   └── router.py            # Intent Router (RAG vs MCP)
├── rag/
│   ├── chunking.py
│   ├── embedding.py
│   ├── generator.py
│   ├── indexer.py
│   ├── ingestion.py
│   ├── pipeline.py
│   └── retriever.py
├── scraping/
│   └── scraping.py
├── storage/
│   └── bucket_storage.py
├── docker-compose.yml
├── main.py
├── requirements.txt
└── .env.example

Roadmap

[x] Web scraping pipeline (crawl4ai + BFS + URL filtering)
[x] LocalStack S3 storage with two-bucket pipeline control
[x] Markdown chunking with header-aware splitting
[x] Local vector embeddings (no API rate limits)
[x] Qdrant vector indexing (2,594 chunks)
[x] RAG pipeline with LangChain LCEL
[x] FastAPI REST endpoint
[x] React frontend with anime noir aesthetic
[x] Full Docker Compose setup (4 containers)
[ ] MCP Server — Intent Router
[ ] MCP Server — Cloud Storage integration
[ ] MCP Server — BigQuery integration
[ ] MCP Server — Compute Engine integration
[ ] RAG evaluation metrics
[ ] LangSmith observability

Design Decisions

Why LocalStack instead of a real database for raw storage? Object storage is the industry standard for data lake pipelines. Two buckets (raw/ and processed/) provide clear pipeline state control. In production, this migrates to AWS S3 or GCS with zero code changes.

Why local embeddings? sentence-transformers/all-MiniLM-L6-v2 runs locally with no API calls, no rate limits, and no cost. LangChain's abstraction makes swapping to any other embedding model a one-line change.

Why Groq instead of OpenAI or Gemini? 14,400 requests/day on the free tier, no credit card required, and vendor-agnostic integration — a key skill for AI Engineering roles.