← Back to blog
Architecture January 10, 2026 · 9 min

Go for the plumbing, Python for the intelligence

We run both in production. Go handles 50,000 req/s at the gateway. Python runs the ML models. Here is why we split them.

Go for the Plumbing, Python for the Intelligence

We learned this lesson building a clinical decision support system. The first version was a monolithic FastAPI application that handled authentication, rate limiting, request routing, caching, AND the RAG inference pipeline. It worked at demo scale. At 200 concurrent users, the event loop was spending more time on JWT validation and cache lookups than on actual inference.

We split it. Go handles the API gateway -- authentication, rate limiting, request routing, response caching. Python handles the ML pipeline -- embedding generation, vector search, LLM orchestration. The gateway runs at 50,000 requests/second on a single pod. The Python services run inference at their own pace, isolated from the gateway's throughput requirements.

The problem

Python dominates machine learning for good reason: PyTorch, TensorFlow, LlamaIndex, LangChain, scikit-learn, and every major model provider's SDK is Python-first. Fighting this ecosystem is pointless.

But Python is mediocre at everything that isn't ML: high-concurrency HTTP serving, connection pooling, binary protocol handling, and CPU-bound request processing. The GIL is real, asyncio has sharp edges, and a FastAPI service handling 5,000 requests/second is working hard. A Go service handling the same load is barely awake.

The mistake is assuming your entire backend needs to be one language.

Our architecture

Client
  |
  v
Go API Gateway (auth, rate limit, cache, routing)
  |
  +---> Go Data Service (CRUD, PostgreSQL, Redis)
  |
  +---> Python RAG Service (LlamaIndex, Qdrant, LLM)
  |
  +---> Python ML Service (model inference, embeddings)

The Go gateway is the single entry point. It handles:

  • JWT validation -- Go's crypto stdlib validates tokens in ~50 microseconds
  • Rate limiting -- Token bucket per API key, backed by Redis
  • Response caching -- For idempotent queries, cached responses skip the backend entirely
  • Request routing -- gRPC to Python services, REST to Go services
  • Request/response logging -- Structured JSON to stdout, collected by Loki

The Python services are stateless workers:

  • RAG service -- Takes a query, retrieves documents from Qdrant, generates a response with citations
  • ML service -- Batch embedding generation, model inference endpoints
  • No auth, no rate limiting -- The gateway handles all cross-cutting concerns

Performance in production

Metric Go gateway Python RAG service
p50 latency 1.5ms 1,200ms (includes LLM)
p99 latency 8ms 3,500ms
Memory usage 25MB 180MB
Requests/sec (single pod) 50,000 50 (inference-bound)
Startup time 200ms 4s
Container image size 12MB 850MB

The numbers tell the story: the gateway is I/O-bound, the ML service is compute-bound. Different performance profiles call for different tools.

Go for the gateway

func (gw *Gateway) handleQuery(w http.ResponseWriter, r *http.Request) {
    // Auth: ~50 microseconds
    claims, err := gw.auth.ValidateToken(r.Header.Get("Authorization"))
    if err != nil {
        http.Error(w, "unauthorized", 401)
        return
    }

    // Rate limit: ~100 microseconds (Redis roundtrip)
    if !gw.limiter.Allow(claims.APIKey) {
        http.Error(w, "rate limit exceeded", 429)
        return
    }

    // Cache check: ~200 microseconds
    cacheKey := hashQuery(r.Body)
    if cached, ok := gw.cache.Get(cacheKey); ok {
        w.Write(cached)
        return
    }

    // Forward to Python RAG service via gRPC
    resp, err := gw.ragClient.Query(r.Context(), &pb.QueryRequest{
        Question: req.Question,
        Filters:  req.Filters,
    })
    // ...
}

The gateway adds 2-8ms of overhead to every request. For inference requests that take 1-3 seconds, this is invisible. For cached responses, the total roundtrip is under 10ms.

Python for the inference

class RAGService(rag_pb2_grpc.RAGServiceServicer):
    def __init__(self):
        self.index = load_qdrant_index()
        self.query_engine = self.index.as_query_engine(
            response_mode="tree_summarize",
            similarity_top_k=8,
        )

    async def Query(self, request, context):
        response = self.query_engine.query(request.question)
        return rag_pb2.QueryResponse(
            answer=str(response),
            citations=[
                rag_pb2.Citation(
                    text=node.text[:500],
                    score=node.score,
                    source=node.metadata.get("file_name", ""),
                )
                for node in response.source_nodes
            ],
        )

The Python service does one thing: inference. It doesn't validate tokens, check rate limits, or manage caching. This isolation means we can scale Python workers independently based on inference queue depth, without over-provisioning for gateway traffic.

What we learned

  • gRPC between Go and Python is worth the setup cost. REST works, but gRPC's code generation, streaming support, and binary serialization cut inter-service latency by 40% compared to JSON over HTTP. We use grpcio in Python and google.golang.org/grpc in Go.
  • Don't put ML dependencies in the gateway image. A Go binary is 12MB. A Python image with PyTorch, LlamaIndex, and model files is 2-4GB. Keeping them separate means the gateway deploys in seconds and the ML service deploys on its own schedule.
  • Python async is not free. FastAPI's async support works well for I/O-bound operations. For CPU-bound inference, you need process-based workers (Gunicorn with sync workers or a separate gRPC server). We run Python inference in synchronous gRPC workers behind a thread pool.
  • The Go gateway is the stability layer. When a Python service crashes or gets overloaded, the gateway returns cached responses or a structured error. Clients never see a raw Python traceback.

The tradeoffs

  • Two languages means two build pipelines, two dependency management systems, two sets of idioms. Our team is comfortable with both, but hiring for "Go and Python" is a narrower pool than hiring for one.
  • gRPC adds complexity. Proto files, code generation, and versioning require discipline. We maintain a shared proto/ directory in the infra repo.
  • Testing the full pipeline requires integration tests. Unit tests cover each language independently. Testing the gateway-to-inference flow requires both services running.

Our recommendation

If your backend does both high-throughput request handling and ML inference, split them. Use Go (or Rust, if you prefer) for the gateway layer -- auth, rate limiting, caching, routing. Use Python for everything that touches model inference, embeddings, or the ML ecosystem.

If your application is purely CRUD with no ML component, Go alone is excellent. If it's purely ML with low traffic, Python alone is fine. The split only earns its complexity when you have both throughput requirements and ML requirements in the same system.

CommitX Technology (OPC) Pvt Ltd
© 2025 — Built with open-source tools, obviously.