Server-side vs Client-side LLM

The two modes#

Server-side LLM (recommended)#

The Pinecall server runs the LLM. You give it a prompt, a model, and (optionally) tool definitions. The server handles STT, runs the LLM, generates TTS — you only handle tool calls.

Realtime speech-to-speech: set llm: "pinecall/gpt-realtime" and the server collapses STT + LLM + TTS into one OpenAI Realtime model — lower latency and native barge-in, same tools and events. See Realtime speech-to-speech.

import { tool } from "@pinecall/sdk";
import { z } from "zod";

const lookupCustomer = tool({
  name: "lookupCustomer",
  description: "Look up a customer by phone",
  schema: z.object({ phone: z.string() }),
  execute: async ({ phone }) => await db.customers.findOne({ phone }),
});

const agent = pc.agent("receptionist", {
  prompt: "You are a helpful receptionist. Be concise.",
  llm: "openai/gpt-5-chat-latest",
  voice: "elevenlabs/sarah",
  stt: "deepgram/flux",
  language: "en",
  tools: [lookupCustomer],
  greeting: "Hello, how can I help?",
});

Client-side LLM (bring your own)#

You run the LLM yourself. The server handles STT → text and text → TTS. You receive the user's text on turn.end, generate a response with whatever LLM you want, and stream it back.

import OpenAI from "openai";
const openai = new OpenAI();

const agent = pc.agent("my-bot", { voice: "cartesia/yumiko", language: "en" });

agent.on("turn.end", async (turn, call) => {
  const stream = call.replyStream(turn);
  const completion = await openai.chat.completions.create({
    llm: "openai/gpt-5-chat-latest",
    messages: [
      { role: "system", content: "You are helpful. Be concise." },
      { role: "user", content: turn.text },
    ],
    stream: true,
  });
  for await (const chunk of completion) {
    if (stream.aborted) break;
    const token = chunk.choices[0]?.delta?.content;
    if (token) stream.write(token);
  }
  stream.end();
});

Which one to choose#

	Server-side	Client-side
LLM choice	OpenAI, Mistral, Google, Anthropic	Any provider, any model, local
You handle conversation history	❌ Server does it	✅ You do it
You see tool calls	✅ Via `llm.toolCall`	✅ You define them
Easier to ship	✅ Yes	Slightly more code
Required for WhatsApp	✅ Yes	❌ No (server-side only)
Latency	Slightly lower (LLM runs near the audio pipeline)	Depends on your provider
Cost	Pinecall passes through provider cost	You pay your provider directly

Pick server-side if: you're using OpenAI, Mistral, Google, or Anthropic, you want the simplest possible code, or you need WhatsApp.

Pick client-side if: you need a specific LLM Pinecall doesn't host (local Ollama, a fine-tuned model), you have an existing LangChain/LlamaIndex pipeline, or you need full control over the prompt-building logic.

You can mix them#

A single Pinecall instance can host multiple agents, each with a different LLM strategy:

// Server-side agent for WhatsApp + phone
const support = pc.agent("support", {
  llm: "openai/gpt-5-chat-latest",
  stt: "deepgram/flux",
  prompt: "...",
  phoneNumber: "+13186330963",
  whatsapp: [{ phoneNumberId: "123", accessToken: "EAA..." }],
});

// Client-side agent using a local Ollama model for a specialized use case
const research = pc.agent("research", { voice: "elevenlabs/george", language: "en" });
research.on("turn.end", async (turn, call) => {
  /* call your own LLM (Ollama, fine-tuned model, ...), stream back */
});

What about hybrid?#

What if you want to use the server-side LLM but inject context or modify history mid-call? You can:

Inject context dynamically — call.addContext("Recent order: #12345 shipped today")
Replace the prompt mid-call — call.setPrompt("Now you're in escalation mode.")
Set template variables — define {{customer_name}} in the prompt, fill it per-call
Modify history — call.addHistory([...]), call.setHistory([...]), call.clearHistory()

See Hot-Reload for the full set of mid-call controls.

What's next#

Hot-reload everything
Tool calling guide
Events reference — see all the events you can hook into