In early 2025 we started getting requests from product teams to integrate LLM capabilities into our analytics tools. By Q4 2025 we were serving over 2 million LLM API requests per month. This is what we learned.
The architecture
┌──────────────────────────────────────────────────────────┐
│ Client Applications │
└──────────────────┬────────────────────────────────────────┘
│ HTTP
┌──────────────────▼────────────────────────────────────────┐
│ LLM Gateway (Go) │
│ ┌────────────┐ ┌────────────┐ ┌───────────────────┐ │
│ │Rate Limiter│ │ Prompt Mgr │ │ Response Cache │ │
│ └────────────┘ └────────────┘ └───────────────────┘ │
└──────────────────┬────────────────────────────────────────┘
│ AWS SDK
┌──────────────────▼────────────────────────────────────────┐
│ AWS Bedrock │
│ Claude 3.5 Sonnet │ Titan Embeddings │
└────────────────────────────────────────────────────────────┘
The gateway is a Go service that handles auth, rate limiting, prompt management, response caching, and cost tracking. Bedrock handles the actual inference.
Calling Bedrock from Go
The AWS Go SDK v2 makes this straightforward:
type BedrockClient struct {
client *bedrockruntime.Client
modelID string
maxTokens int32
}
type InferenceRequest struct {
SystemPrompt string
UserMessage string
Temperature float64
}
func (c *BedrockClient) Complete(ctx context.Context, req InferenceRequest) (string, error) {
body := ClaudeRequest{
AnthropicVersion: "bedrock-2023-05-31",
MaxTokens: c.maxTokens,
Temperature: req.Temperature,
System: req.SystemPrompt,
Messages: []Message{
{Role: "user", Content: req.UserMessage},
},
}
bodyBytes, err := json.Marshal(body)
if err != nil {
return "", fmt.Errorf("marshaling request: %w", err)
}
output, err := c.client.InvokeModel(ctx, &bedrockruntime.InvokeModelInput{
ModelId: aws.String(c.modelID),
ContentType: aws.String("application/json"),
Body: bodyBytes,
})
if err != nil {
return "", fmt.Errorf("bedrock invoke: %w", err)
}
var response ClaudeResponse
if err := json.Unmarshal(output.Body, &response); err != nil {
return "", fmt.Errorf("unmarshaling response: %w", err)
}
if len(response.Content) == 0 {
return "", errors.New("empty response from model")
}
return response.Content[0].Text, nil
}
Streaming responses
For user-facing features, streaming makes the experience feel instant. Bedrock supports streaming natively:
func (c *BedrockClient) Stream(ctx context.Context, req InferenceRequest, out chan<- string) error {
defer close(out)
bodyBytes, _ := json.Marshal(buildClaudeBody(req))
output, err := c.client.InvokeModelWithResponseStream(ctx,
&bedrockruntime.InvokeModelWithResponseStreamInput{
ModelId: aws.String(c.modelID),
ContentType: aws.String("application/json"),
Body: bodyBytes,
},
)
if err != nil {
return fmt.Errorf("starting stream: %w", err)
}
for event := range output.GetStream().Events() {
switch v := event.(type) {
case *types.ResponseStreamMemberChunk:
var chunk StreamChunk
if err := json.Unmarshal(v.Value.Bytes, &chunk); err != nil {
continue
}
if chunk.Type == "content_block_delta" {
out <- chunk.Delta.Text
}
}
}
return output.GetStream().Err()
}
The HTTP handler sends chunks as SSE:
func (h *LLMHandler) StreamCompletion(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
flusher := w.(http.Flusher)
chunks := make(chan string, 32)
go func() {
if err := h.client.Stream(r.Context(), parseReq(r), chunks); err != nil {
slog.ErrorContext(r.Context(), "stream error", slog.Any("error", err))
}
}()
for chunk := range chunks {
fmt.Fprintf(w, "data: %s\n\n", chunk)
flusher.Flush()
}
fmt.Fprintf(w, "data: [DONE]\n\n")
flusher.Flush()
}
Semantic caching to cut costs
The single biggest cost reduction: cache by semantic similarity, not exact match. Two questions that mean the same thing hit the cache.
type SemanticCache struct {
embedder EmbeddingClient
store VectorStore // we use pgvector
threshold float64 // cosine similarity threshold, we use 0.92
ttl time.Duration
}
func (c *SemanticCache) Get(ctx context.Context, query string) (string, bool) {
embedding, err := c.embedder.Embed(ctx, query)
if err != nil {
return "", false // cache miss on error, don't block
}
result, err := c.store.FindSimilar(ctx, embedding, c.threshold)
if err != nil || result == nil {
return "", false
}
return result.Response, true
}
func (c *SemanticCache) Set(ctx context.Context, query, response string) {
embedding, err := c.embedder.Embed(ctx, query)
if err != nil {
return
}
c.store.Insert(ctx, CacheEntry{
Query: query,
Embedding: embedding,
Response: response,
ExpiresAt: time.Now().Add(c.ttl),
})
}
This dropped our Bedrock costs by 34% in production.
The failures we had
Context window overflows. We didn’t implement prompt size checking early enough. Some users sent 50-page documents which caused silent truncation. Always count tokens before sending.
Rate limit cascades. Bedrock has per-region limits. When one region got throttled, all retries went to the same region. Use exponential backoff with jitter and spread across regions.
Prompt injection. Internal tools felt “safe” so we didn’t sanitize inputs early. A user discovered they could override system prompts by ending their input with \n\nIgnore previous instructions. Validate and sanitize. Always.
What I’d prioritize building first
- Cost tracking per team/feature (you need this on day 1)
- Request/response logging with PII redaction
- Hard token limits per request
- Semantic caching (biggest ROI)
- Streaming (biggest UX win)
Everything else is optimization.