Building an MCP server for persistent workout memory in Pelaris

The coaching problem with most AI fitness apps is not the model. It is the memory. AI assistants like Claude and ChatGPT do maintain context within a conversation, but context windows have limits. As a conversation grows, the model compacts earlier content to stay within its window. Compact a few times and the details get lost, weightings shift, and the coaching context degrades. There is no persistent, structured record of your training history that survives across sessions or context resets. The model has no lasting idea you ran a 90-minute long run on tired legs, or that your squat has been stalling for three weeks, or that last Tuesday you skipped the session because your knee flared up.

Pelaris already had all that data. It was sitting in Firebase Firestore, structured, timestamped, and queryable. The problem was not that the data did not exist. The problem was that Claude and ChatGPT had no way to reach it.

Building the MCP server was not about adding memory. It was about exposing the memory we already had.

What MCP actually gives you

Model Context Protocol is Anthropic’s open standard for connecting AI models to external tools and data sources. The pitch is clean: instead of baking tool logic into your prompt or relying on vendor-specific function calling, you define a server that exposes tools, and any compliant client can call them.

What it does not give you is persistence. MCP is stateless by default. The transport layer (StreamableHTTP in SDK v1.12.0) handles individual request-response cycles. There is no session continuity baked in. If you want the model to remember that a user squatted 120kg last Thursday, you have to build that into the server layer yourself. The protocol does not do it for you.

This is the thing most hello-world MCP tutorials skip. They show you how to wire up a tool that returns the current weather or queries a database on demand. That works fine for one-shot lookups. It breaks down the moment you need a coaching model to reason across weeks of training history, because the history has to come from somewhere, and that somewhere has to be your server.

For Pelaris, the answer was Firebase Firestore. The workout data already lived there. The MCP server’s job was to sit in front of it, handle auth, scrub sensitive fields, and give the model a clean tool surface for reading and writing training data.

The tool surface: 21 tools, five domains

The server exposes 21 tools. Grouped by what a coaching model actually needs to reason about:

Analytics & History

get_training_overview - comprehensive snapshot: active programs, recent sessions, check-ins, goals
get_benchmarks - performance metrics with history, trends, improvement direction
get_body_analysis - body composition data with temporal changes

Session Management

get_session_details - single session: exercises, sets, targets, actuals, feedback
log_workout - new entry or mark planned session as completed, with idempotency via date+sport+duration hash
create_planned_session - create workout with exercise targets
modify_training_session - reduce volume, increase intensity, swap exercises, reschedule
swap_exercise - find alternatives from static mapping or auto-apply

Program Orchestration

get_active_program - full program details with phase, weekly structure, session breakdowns
generate_weekly_plan - orchestrates 3-stage AI pipeline via HTTP bridge to Cloud Functions
manage_program - archive programs, list history, manage active program

Coaching & Insights

get_coach_insight - data-driven coaching observations: consistency, fatigue, progress patterns
search_training_resources - curated coaching resource library
get_weekly_debrief - weekly summary with session completion, highlights, next-week focus
record_injury - injury with severity, side, affected exercises; returns coaching guidance
record_benchmark - new benchmark or update existing, auto-saves history

User State

get_onboarding_status - intake completion, sport selection, device connections
update_user_profile - equipment, availability, session duration, experience, units
daily_check_in - readiness, soreness, sleep quality, mood (duplicate prevention per day)
manage_goals - CRUD: create, update, complete, list goals
log_coach_feedback - quality feedback about MCP experience (meta-tool for improvement)

Every tool follows the same internal pattern: validate the input schema via Zod, verify the auth token, check the relevant OAuth scope, run the Firebase operation, scrub the response through the PII layer, return structured JSON. That consistency matters more than it sounds. When you have 21 tools and multiple AI clients calling them, anything inconsistent in the response shape becomes a debugging problem at 11pm.

The tool naming convention is also deliberate. get_workout_progression is unambiguous. list_recent_workouts is unambiguous. When the model is deciding which tool to call based on a natural language query like “how has my bench press moved over the last six weeks,” you want the tool name to be the obvious match. Clever naming costs you reliability.

The concurrency bug that served someone else’s data

This section comes before auth and privacy because it is the most important thing I learned building this, and the target audience for this article will hit it.

The server handles multiple concurrent requests. Firebase operations are async. Early in development, we hit a race condition where concurrent tool calls from the same session were occasionally reading stale data because the async context was not being isolated correctly between requests.

The fix was AsyncLocalStorage from Node’s async_hooks module. Each incoming request gets its own storage context, which carries the verified auth token and user ID for that request’s lifetime. Subsequent async operations within the same request chain read from that context rather than from a shared module-level variable.

The implementation in request-context.ts looks like this:

const authStore = new AsyncLocalStorage<McpTokenClaims | null>();

export function runWithAuth<T>(claims: McpTokenClaims | null, fn: () => T): T {
  return authStore.run(claims, fn);
}

export function getRequestAuth(): McpTokenClaims | null {
  return authStore.getStore() ?? null;
}

Each request wraps its execution in runWithAuth(). Any tool handler calls getRequestAuth() to get the correct claims, even deep in nested async chains. No parameter passing, no race conditions.

This is a standard Node concurrency pattern, but it is easy to miss when you are building an MCP server quickly. The symptom was subtle: most requests worked fine, occasional requests returned data for the wrong user. In a fitness app, that is a trust failure. This was the highest-priority fix we shipped that month.

OAuth 2.0 PKCE: the part that took longest

Most MCP tutorials authenticate with an API key in a header. That is fine for local tooling. It is not acceptable for a multi-user fitness app where the data is personal health information.

Pelaris uses OAuth 2.0 with PKCE (Proof Key for Code Exchange). The implementation has several non-obvious pieces.

Custom JWT verification. Rather than delegating token verification to Firebase Admin SDK’s standard flow, the server uses a custom HMAC-SHA256 JWT verifier. The reason is issuer discovery. Claude Desktop and ChatGPT resolve the OAuth issuer URL differently. Claude Desktop follows the standard /.well-known/openid-configuration discovery path without issue. ChatGPT’s OAuth implementation expects the issuer in the token to match a specific format, and if your Firebase project ID or region produces an issuer string that does not match what ChatGPT expects, the whole auth flow fails silently.

The fix was an OAuth proxy pattern: a thin intermediary that normalises the issuer claim before the token reaches the MCP server. The server then verifies the normalised token with HMAC-SHA256 rather than relying on the upstream issuer string being consistent across clients. It is an extra moving part, but it removed an entire class of cross-client auth failures.

The verification itself is manual rather than relying on a JWT library:

const expectedSig = crypto
  .createHmac("sha256", secret)
  .update(`${headerB64}.${payloadB64}`, "utf8")
  .digest("base64url");

// Timing-safe comparison to prevent timing attacks
const sigMatch = crypto.timingSafeEqual(
  Buffer.from(expectedSig, "utf8"),
  Buffer.from(signatureB64, "utf8"),
);

Full control over secret sourcing, timing-safe comparison, and exact validation logic. Worth the extra code.

Scope model. The server uses six scopes: profile:read, training:read, training:write, health:read, health:write, and coach:read. Each tool checks its required scope directly in the handler. No global middleware. Each tool is self-contained.

Dynamic Client Registration. Alongside pre-registered clients for Claude (pelaris-claude) and ChatGPT (pelaris-chatgpt), the server supports Dynamic Client Registration (DCR) for any other MCP-compliant client. This matters as the MCP ecosystem grows. You do not want to manually register every new client that wants to connect.

Token revocation. The server includes isTokenRevoked() which checks Firestore before accepting a token. If a user revokes access from within the Pelaris app, the MCP server honours that immediately rather than waiting for token expiry.

Secret trimming. This sounds trivial. It is not. Firebase service account credentials sometimes arrive with trailing whitespace or newline characters when pulled from environment variables, particularly in certain CI/CD configurations. The JWT verification would fail intermittently and the error messages were not helpful. Adding explicit .trim() calls on credential fields before any crypto operation removed the flakiness entirely. It is the kind of fix that takes four hours to find and one line to implement.

PII scrubbing: two passes before anything leaves the server

Workout data is personal health data. The MCP server runs a two-pass scrub via scrubber.ts before any response goes back to the AI client.

The first pass strips direct identifiers: userId, uid, email, displayName, phone, ownerUid, profileId, and related fields. Any field in the set is replaced with [REDACTED] regardless of content. Recursive through nested objects and arrays.

The second pass applies a field-level allowlist. Rather than trying to detect and remove sensitive fields reactively, the server explicitly defines which fields are permitted in each tool’s response schema. Anything not on the allowlist is dropped. This is more conservative than a blocklist approach, and it means adding a new field to a response requires a deliberate decision, not an assumption that it is safe.

The scrubber is not a compliance solution on its own. It is a reasonable-effort layer that reduces the surface area of what gets sent to third-party model providers. The architecture doc is explicit about this distinction.

Rate limiting: honest about the limitations

The server implements in-memory rate limiting. Read operations are capped at 60 requests per hour per user at the middleware level. Write operations are capped at 50 per hour, enforced at the tool level via checkWriteRateLimit(). Requests that exceed the limit get a structured error response rather than a silent failure. Admin-authenticated requests skip rate limiting entirely.

The limitation is real and worth naming: in-memory rate limiting does not work across multiple server instances. If the MCP server scales horizontally, each instance maintains its own bucket state. A user hitting two different instances can effectively double their allowed rate.

For the current deployment, this is an acceptable tradeoff. Pelaris is not at the scale where horizontal MCP server scaling is a live concern. When it is, the fix is moving rate limit state to Firestore or Redis. The in-memory implementation is a conscious decision with a known upgrade path, not an oversight.

Claude vs ChatGPT vs Others: where they actually diverge

The goal was a single tool surface that works identically across all clients. That is mostly true. The divergence is at the protocol layer, not the tool layer.

Transport. Claude Desktop supports StreamableHTTP and SSE. ChatGPT’s action infrastructure has its own transport expectations. The server handles both, but the connection setup differs. Claude Desktop connects directly via StreamableHTTP. ChatGPT routes through its action manifest, which requires an OpenAPI-compatible schema alongside the MCP tool definitions. For other MCP clients, Dynamic Client Registration handles onboarding without manual configuration on our side. Maintaining the OpenAPI schema in sync with the MCP tool definitions is additional surface area, but it is manageable.

Tool call batching. Claude Desktop will sometimes batch multiple tool calls in a single turn. ChatGPT typically calls tools sequentially. The server handles both, but the batching behaviour means the Firebase read operations need to be genuinely independent. Any tool that assumes a specific prior tool has already run will fail silently when the client batches calls in a different order.

Error handling. Claude Desktop surfaces MCP tool errors to the user in a reasonably readable format. ChatGPT’s error handling is more opaque. A tool failure that produces a clear error message in Claude Desktop might produce a generic “I couldn’t complete that action” in ChatGPT. The practical implication: error messages in tool responses need to be structured for the model to interpret and relay, not readable only by a human.

Auth discovery. Claude discovers auth endpoints via RFC 8414 metadata at /.well-known/oauth-authorization-server. ChatGPT uses /.well-known/openai-apps-challenge for domain verification. Both are handled, but they require the MCP server to be the issuer domain, which is why the OAuth proxy pattern described above is necessary rather than optional.

None of these differences required changes to the underlying tool logic. They required changes to how the server presents itself to each client.

What broke in real use

A build log that only covers what worked is not useful. Here is what actually broke.

The concurrent request that leaked another user’s auth context (described above): async context isolation. Highest severity. Fixed with AsyncLocalStorage.

The AI pipeline bridge that still is not reliable: The generate_weekly_plan tool orchestrates a 3-stage AI pipeline via HTTP bridge to Cloud Functions (Strategy → Overviews → Sessions, running on Vertex AI). The bridge needs debugging. The tool exists, the pipeline exists, the connection between them is not reliable yet.

OAuth issuer discovery: Claude derives the metadata URL from the issuer domain. If the issuer pointed to the Cloud Function domain instead of the MCP server domain, Claude could not discover auth endpoints. Fixed by making the MCP server the issuer and proxying auth calls through to the Cloud Function OAuth server.

Secret trimming: described in the auth section above. One line fix.

Tool response size. Early versions of the overview and session list tools returned full workout records including all sets, reps, load, and notes for each session. For a user with a year of training history and a 30-day date range, the response was large enough to consume significant context window. The fix was a summary-first pattern: default responses return workout metadata (date, type, total volume, duration), with a separate tool call required to fetch the full record for a specific session. The model now retrieves summaries first and drills into detail only when the query requires it.

Resources and prompts: the coaching voice layer

Tools handle data retrieval and mutation. Resources and prompts handle the coaching experience.

The server exposes two MCP resources:

pelaris://coach/personality - full coaching persona definition: tone, anti-patterns, response format, voice examples. Instructs any AI client to adopt the Pelaris Coach voice consistently.
pelaris://sports/methodologies - 28 training methodologies across 7 sports with principles, phase structure, and exercise patterns.

These are not tool calls. They are context documents that get included in the model’s working memory for the session.

The prompts layer defines starting points for common coaching interactions:

weekly_plan_review - post-week analysis template
session_debrief - post-workout coaching template
benchmark_check_in - progress review template

Rather than leaving the model to decide how to frame a progression plateau or a missed session, the server provides structured prompt templates that the client can invoke. “Coach me through this week’s plan” triggers a prompt that pulls the current week’s schedule, last week’s completion rate, and the user’s fatigue flags, then frames a coaching conversation around that specific context.

This is where the product experience lives. The tools are plumbing. The resources and prompts are what make the model feel like a coach rather than a database query interface.

Architecture summary

Layer	Implementation	Key decision
Framework	Express.js + MCP SDK v1.12.0	Stateless HTTP transport
Auth	OAuth 2.0 PKCE + custom HMAC-SHA256 JWT	Proxy pattern for cross-client issuer normalisation
Client registration	Pre-registered (Claude, ChatGPT) + DCR	Third-party clients onboard without manual config
Token revocation	`isTokenRevoked()` checks Firestore	Immediate revocation, no waiting for expiry
Data store	Firebase Firestore	Existing Pelaris data layer; no new infrastructure
PII scrubbing	Two-pass: identifier strip + field allowlist	Conservative by default; additions require explicit decisions
Concurrency	`AsyncLocalStorage` per request	Fixes auth context leakage under concurrent load
Rate limiting	In-memory sliding window (60 reads/hr, 50 writes/hr)	Known limitation at scale; Firestore/Redis upgrade path documented
Cross-client	Single tool surface, client-specific transport handling	Tool logic unchanged; presentation layer adapts
Resources	2 (coach personality, methodologies)	Coaching voice consistency across all AI clients
Prompts	3 (weekly review, session debrief, benchmark check-in)	User starting points for common coaching interactions
Deploy	Docker → Cloud Run	Multi-stage build, stateless horizontal scaling

What building this taught me about MCP servers

The protocol is further along than most people realise. The tooling is solid. The SDK is stable. The hard problems are not in the spec.

They are in the layers the spec does not cover: auth that works across multiple clients with divergent expectations, concurrency patterns that hold under real load, response shaping that keeps context windows useful rather than bloated, and a resources and prompts layer that actually carries domain expertise rather than just surfacing raw data.

The read/write tool split that most MCP tutorials demonstrate is the easy part. The interesting architecture is what sits underneath: how you scope the data a model can see, how you enforce that at the auth layer, how you design tool responses so the model retrieves summaries first and drills to detail only when the query warrants it. Get those decisions wrong and you end up with a server that technically works but produces a degraded model experience at scale.

A few things I would do differently on the next build:

Start with the resources and prompts layer, not the tools. The tools are mechanical. The resources layer is where you encode actual domain expertise, and it shapes every model interaction from the first session.
Design the auth flow for multiple clients on day one. Retrofitting cross-client OAuth is significantly more work than building it with that assumption from the start.
Rate limit state should live in Firestore or Redis from the beginning. In-memory is fine for a prototype; it creates a migration cost later.
Treat tool response size as a first-class constraint. The context window is the resource you are managing. Every tool call that returns more data than the model needs for that query is waste.

MCP is going to become the standard integration layer between enterprise data and AI clients. The companies that figure out how to expose their existing data through well-designed MCP servers, with proper auth, proper scoping, and a coherent resources layer, will have a structural advantage over those trying to solve this with prompt engineering alone.

The Pelaris server is one implementation. The patterns it forced me to work through apply to any domain where you have structured longitudinal data and want an AI model to reason intelligently across it.