Building Scalable Microservices with Go in 2026

A deep dive into production-grade microservices with Go — covering service mesh integration, gRPC streaming, distributed tracing, and the patterns we use at Nielsen to handle millions of events per day.

Microservices have matured significantly since their hype peak. In 2026, building them well means making deliberate choices around communication, observability, and failure modes — not just splitting a monolith into smaller HTTP servers.

This is a practical guide based on what we’ve actually shipped at scale.

The architecture we settled on

After iterating over three major versions of our analytics platform, we landed on this structure:

┌─────────────┐    gRPC     ┌─────────────────┐
│  API Gateway │ ──────────▶ │  Event Ingestor  │
└─────────────┘             └────────┬────────┘
                                     │ Kafka
                            ┌────────▼────────┐
                            │  Stream Processor│
                            └────────┬────────┘
                                     │
                  ┌──────────────────┼──────────────────┐
                  ▼                  ▼                   ▼
           ┌──────────┐      ┌──────────────┐   ┌──────────────┐
           │ TimeSeries│      │  Aggregation  │   │  Notification│
           │   Store   │      │    Service    │   │   Service    │
           └──────────┘      └──────────────┘   └──────────────┘

Each service owns its own data store, deploys independently, and communicates async-first via Kafka with synchronous gRPC fallback for low-latency paths.

Service boundaries that actually hold

The biggest mistake is boundaries drawn around technical concerns (“data layer service”, “auth service”) instead of business capabilities. We model around bounded contexts.

// Good: owns the full campaign lifecycle
type CampaignService struct {
    repo        CampaignRepository
    eventBus    EventPublisher
    scorer      AudienceScorer
    logger      *slog.Logger
}

func (s *CampaignService) Activate(ctx context.Context, id string) error {
    camp, err := s.repo.FindByID(ctx, id)
    if err != nil {
        return fmt.Errorf("campaign.Activate: %w", err)
    }
    if !camp.CanActivate() {
        return ErrCampaignNotReady
    }
    camp.Status = StatusActive
    if err := s.repo.Save(ctx, camp); err != nil {
        return fmt.Errorf("campaign.Activate save: %w", err)
    }
    return s.eventBus.Publish(ctx, CampaignActivatedEvent{ID: id})
}

Notice the domain logic (CanActivate) sits in the domain object — not in the handler or repository.

gRPC for inter-service calls

REST is fine for external APIs. Internally, gRPC gives you typed contracts, streaming, and better performance.

// analytics.proto
syntax = "proto3";

service AnalyticsService {
  rpc IngestEvents(stream RawEvent) returns (IngestResponse);
  rpc QueryMetrics(MetricsRequest) returns (stream MetricPoint);
}

message RawEvent {
  string campaign_id = 1;
  string event_type  = 2;
  int64  timestamp   = 3;
  map<string, string> dimensions = 4;
}

The Go client with context propagation and retry:

func (c *AnalyticsClient) IngestBatch(ctx context.Context, events []RawEvent) error {
    stream, err := c.client.IngestEvents(ctx)
    if err != nil {
        return fmt.Errorf("opening stream: %w", err)
    }

    for _, ev := range events {
        if err := stream.Send(&pb.RawEvent{
            CampaignId: ev.CampaignID,
            EventType:  ev.Type,
            Timestamp:  ev.TS.UnixMilli(),
            Dimensions: ev.Dims,
        }); err != nil {
            return fmt.Errorf("sending event: %w", err)
        }
    }

    resp, err := stream.CloseAndRecv()
    if err != nil {
        return fmt.Errorf("closing stream: %w", err)
    }
    if resp.Dropped > 0 {
        c.logger.Warn("events dropped", "count", resp.Dropped)
    }
    return nil
}

Distributed tracing with OpenTelemetry

Every request that crosses a service boundary gets a trace. We use the OTEL SDK with a Jaeger backend.

func initTracer(serviceName string) (*trace.TracerProvider, error) {
    exporter, err := otlptracehttp.New(context.Background(),
        otlptracehttp.WithEndpoint(os.Getenv("OTEL_ENDPOINT")),
        otlptracehttp.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String(serviceName),
            attribute.String("env", os.Getenv("APP_ENV")),
        )),
    )
    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.TraceContext{})
    return tp, nil
}

Instrument your handlers with a single middleware:

func TraceMiddleware(next http.Handler) http.Handler {
    return otelhttp.NewHandler(next, "http.request",
        otelhttp.WithMessageEvents(otelhttp.ReadEvents, otelhttp.WriteEvents),
    )
}

Health checks worth having

A /health endpoint that only returns 200 when the service is genuinely ready:

type HealthChecker struct {
    db    *sql.DB
    kafka *kafka.Producer
}

func (h *HealthChecker) Check(w http.ResponseWriter, r *http.Request) {
    checks := map[string]string{}
    
    ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
    defer cancel()

    if err := h.db.PingContext(ctx); err != nil {
        checks["postgres"] = "unreachable: " + err.Error()
    } else {
        checks["postgres"] = "ok"
    }

    if err := h.kafka.Flush(100); err != nil {
        checks["kafka"] = "degraded"
    } else {
        checks["kafka"] = "ok"
    }

    status := http.StatusOK
    for _, v := range checks {
        if v != "ok" {
            status = http.StatusServiceUnavailable
            break
        }
    }

    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(status)
    json.NewEncoder(w).Encode(map[string]interface{}{
        "status": checks,
        "uptime": time.Since(startTime).String(),
    })
}

What I’d tell myself from 3 years ago

Start with fewer services than you think you need. A well-structured monolith is easier to extract from than a poorly-bounded microservice cluster. We spent six months untangling two services that shared a database and called each other in a cycle.

The observable, testable, independently deployable properties matter more than the “micro” part.