Microservices have matured significantly since their hype peak. In 2026, building them well means making deliberate choices around communication, observability, and failure modes — not just splitting a monolith into smaller HTTP servers.
This is a practical guide based on what we’ve actually shipped at scale.
The architecture we settled on
After iterating over three major versions of our analytics platform, we landed on this structure:
┌─────────────┐ gRPC ┌─────────────────┐
│ API Gateway │ ──────────▶ │ Event Ingestor │
└─────────────┘ └────────┬────────┘
│ Kafka
┌────────▼────────┐
│ Stream Processor│
└────────┬────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ TimeSeries│ │ Aggregation │ │ Notification│
│ Store │ │ Service │ │ Service │
└──────────┘ └──────────────┘ └──────────────┘
Each service owns its own data store, deploys independently, and communicates async-first via Kafka with synchronous gRPC fallback for low-latency paths.
Service boundaries that actually hold
The biggest mistake is boundaries drawn around technical concerns (“data layer service”, “auth service”) instead of business capabilities. We model around bounded contexts.
// Good: owns the full campaign lifecycle
type CampaignService struct {
repo CampaignRepository
eventBus EventPublisher
scorer AudienceScorer
logger *slog.Logger
}
func (s *CampaignService) Activate(ctx context.Context, id string) error {
camp, err := s.repo.FindByID(ctx, id)
if err != nil {
return fmt.Errorf("campaign.Activate: %w", err)
}
if !camp.CanActivate() {
return ErrCampaignNotReady
}
camp.Status = StatusActive
if err := s.repo.Save(ctx, camp); err != nil {
return fmt.Errorf("campaign.Activate save: %w", err)
}
return s.eventBus.Publish(ctx, CampaignActivatedEvent{ID: id})
}
Notice the domain logic (CanActivate) sits in the domain object — not in the handler or repository.
gRPC for inter-service calls
REST is fine for external APIs. Internally, gRPC gives you typed contracts, streaming, and better performance.
// analytics.proto
syntax = "proto3";
service AnalyticsService {
rpc IngestEvents(stream RawEvent) returns (IngestResponse);
rpc QueryMetrics(MetricsRequest) returns (stream MetricPoint);
}
message RawEvent {
string campaign_id = 1;
string event_type = 2;
int64 timestamp = 3;
map<string, string> dimensions = 4;
}
The Go client with context propagation and retry:
func (c *AnalyticsClient) IngestBatch(ctx context.Context, events []RawEvent) error {
stream, err := c.client.IngestEvents(ctx)
if err != nil {
return fmt.Errorf("opening stream: %w", err)
}
for _, ev := range events {
if err := stream.Send(&pb.RawEvent{
CampaignId: ev.CampaignID,
EventType: ev.Type,
Timestamp: ev.TS.UnixMilli(),
Dimensions: ev.Dims,
}); err != nil {
return fmt.Errorf("sending event: %w", err)
}
}
resp, err := stream.CloseAndRecv()
if err != nil {
return fmt.Errorf("closing stream: %w", err)
}
if resp.Dropped > 0 {
c.logger.Warn("events dropped", "count", resp.Dropped)
}
return nil
}
Distributed tracing with OpenTelemetry
Every request that crosses a service boundary gets a trace. We use the OTEL SDK with a Jaeger backend.
func initTracer(serviceName string) (*trace.TracerProvider, error) {
exporter, err := otlptracehttp.New(context.Background(),
otlptracehttp.WithEndpoint(os.Getenv("OTEL_ENDPOINT")),
otlptracehttp.WithInsecure(),
)
if err != nil {
return nil, err
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(serviceName),
attribute.String("env", os.Getenv("APP_ENV")),
)),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.TraceContext{})
return tp, nil
}
Instrument your handlers with a single middleware:
func TraceMiddleware(next http.Handler) http.Handler {
return otelhttp.NewHandler(next, "http.request",
otelhttp.WithMessageEvents(otelhttp.ReadEvents, otelhttp.WriteEvents),
)
}
Health checks worth having
A /health endpoint that only returns 200 when the service is genuinely ready:
type HealthChecker struct {
db *sql.DB
kafka *kafka.Producer
}
func (h *HealthChecker) Check(w http.ResponseWriter, r *http.Request) {
checks := map[string]string{}
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
if err := h.db.PingContext(ctx); err != nil {
checks["postgres"] = "unreachable: " + err.Error()
} else {
checks["postgres"] = "ok"
}
if err := h.kafka.Flush(100); err != nil {
checks["kafka"] = "degraded"
} else {
checks["kafka"] = "ok"
}
status := http.StatusOK
for _, v := range checks {
if v != "ok" {
status = http.StatusServiceUnavailable
break
}
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(status)
json.NewEncoder(w).Encode(map[string]interface{}{
"status": checks,
"uptime": time.Since(startTime).String(),
})
}
What I’d tell myself from 3 years ago
Start with fewer services than you think you need. A well-structured monolith is easier to extract from than a poorly-bounded microservice cluster. We spent six months untangling two services that shared a database and called each other in a cycle.
The observable, testable, independently deployable properties matter more than the “micro” part.