Skip to main content

AI Token Metering

Multi-model token metering.

Multi-model token metering with separate input/output credits, tiered output pricing, monthly credit grants that buffer soft-limit overage, and the pre/post-call enforcement split that prevents you from paying for requests you've already blocked.


Policy

policy:
credits:
# Discrete: one per model direction.
# overhead_cost is what you pay; price is what you charge.
sonnet_input:
description: Claude Sonnet 4 — input tokens
overhead_cost: 0.000003
pricing_model: flat
price: { amount: 0.000004 }
stof_units: int
resets: true

sonnet_output:
description: Claude Sonnet 4 — output tokens
overhead_cost: 0.000015
pricing_model: tiered
tiers:
- up_to: 200000
price: { amount: 0.000022 }
- up_to: 1000000
price: { amount: 0.000020 }
-
price: { amount: 0.000018 }
stof_units: int
resets: true

haiku_input:
description: Claude Haiku 4.5 — input tokens
overhead_cost: 0.0000008
pricing_model: flat
price: { amount: 0.000001 }
stof_units: int
resets: true

haiku_output:
description: Claude Haiku 4.5 — output tokens
overhead_cost: 0.000001
pricing_model: flat
price: { amount: 0.0000015 }
stof_units: int
resets: true

# Abstract: the customer-facing unit. Maps to model tokens via exchange.
ai_credit:
description: AI Credits
label: AI Credit
unit: credit

exchange:
rune: { value: 1, currency: usd }
ai_credit: { value: 1.25, currency: rune }
sonnet_input: { value: 0.000004, currency: ai_credit }
sonnet_output: { value: 0.000020, currency: ai_credit }
haiku_input: { value: 0.000001, currency: ai_credit }
haiku_output: { value: 0.0000015, currency: ai_credit }

plans:
starter:
label: Starter
period: monthly
default: true
entitlements:
chat_access:
description: Access to AI chat

sonnet_input:
limit: { credit: sonnet_input, mode: hard, value: 500000, resets: true, reset_inc: 1day }
sonnet_output:
limit: { credit: sonnet_output, mode: hard, value: 200000, resets: true, reset_inc: 1day }
haiku_input:
limit: { credit: haiku_input, mode: hard, value: 2000000, resets: true, reset_inc: 1day }
haiku_output:
limit: { credit: haiku_output, mode: hard, value: 1000000, resets: true, reset_inc: 1day }

growth:
label: Growth
period: monthly
entitlements:
chat_access:
description: Access to AI chat

sonnet_input:
limit: { credit: sonnet_input, mode: soft, value: 2000000, resets: true, reset_inc: 1day }
sonnet_output:
limit: { credit: sonnet_output, mode: soft, value: 800000, resets: true, reset_inc: 1day }
haiku_input:
limit: { credit: haiku_input, mode: soft, value: 10000000, resets: true, reset_inc: 1day }
haiku_output:
limit: { credit: haiku_output, mode: soft, value: 5000000, resets: true, reset_inc: 1day }

topups:
# Included monthly: 50 AI credits buffer soft-limit overage
monthly_credits:
description: 50 AI credits included monthly
credit: ai_credit
value: 50
included: true
resets: true
reset_inc: 30days
reset_mode: hard # unused credits don't roll over

# Purchasable: 200 AI credits, valid 90 days
credit_pack_200:
description: 200 AI credits
credit: ai_credit
value: 200
price: { amount: 24.00 }
expires_after: 90days

Integration

import { Limitr } from '@formata/limitr';
import { readFileSync } from 'fs';

const policy = await Limitr.new(readFileSync('./policy.yaml', 'utf-8'), 'yaml');

// Wire up overage billing before any requests are processed
policy.addHandler('billing', (key: string, value: unknown) => {
if (key === 'meter-overage') {
const event = JSON.parse(value as string);
// event.overage is the amount not covered by the monthly grant
billing.queueCharge({
customerId: event.customer.id,
credit: event.credit.description,
units: event.overage,
entitlement: event.entitlement,
});
}
});

type Model = 'sonnet' | 'haiku';

async function handleChatRequest(
customerId: string,
prompt: string,
model: Model = 'sonnet'
) {
const inputEntitlement = `${model}_input`;
const outputEntitlement = `${model}_output`;

await policy.ensureCustomer(customerId, 'starter');

// 1. Feature gate — does this customer have chat access?
if (!await policy.check(customerId, 'chat_access')) {
return { error: 'No chat access on this plan', code: 'NO_ACCESS' };
}

// 2. Pre-flight check on estimated input tokens.
// Use check() — don't consume yet in case the LLM call errors.
const estimatedInput = estimateTokens(prompt);
if (!await policy.check(customerId, inputEntitlement, estimatedInput)) {
const remaining = await policy.remaining(customerId, inputEntitlement);
return { error: 'Daily token limit reached', code: 'LIMIT_REACHED', remaining };
}

// 3. Call the LLM
const response = await llm.complete({ model, prompt });
const actualOutput = response.usage.output_tokens;

// 4. Meter actual input — allow() enforces and meters in one operation.
await policy.allow(customerId, inputEntitlement, estimatedInput);

// 5. Meter actual output post-call — you don't know the count until the response arrives.
await policy.allow(customerId, outputEntitlement, actualOutput);

return { content: response.content, usage: { input: estimatedInput, output: actualOutput } };
}

Notes

Why separate input and output credits? Input and output have different costs and different margins. A single token credit makes margin tracking inaccurate and prevents applying different pricing models per direction — e.g. tiered output pricing to incentivize concise responses. Separate them.

Why check() before the call, then allow() after? You don't have actual token counts until the LLM responds. check() prevents blocking a request you'll then run anyway. allow() after the call meters the real counts. If using hard limits, make sure what you check is what you consume — if actual input differs significantly from estimated, consider splitting them into separate check/allow operations.

Why tiered output pricing on Growth? Output tokens are expensive. Tiered pricing gives customers a rate benefit at lower volumes and protects your margin at high volumes — the same structure most model providers use. The tiered model applies each band's rate only to units within that band (graduated, not retroactive). Use volume if you want a single rate applied retroactively once a threshold is crossed.

The monthly grant on Growthmonthly_credits is included: true, so all Growth customers get 50 AI credits automatically. These are drawn against overage before any meter-overage event fires. At the exchange rate, 50 AI credits cover roughly 250,000 Sonnet input tokens of overage — a meaningful buffer that reduces billing noise for customers who occasionally spike.

Multi-model plan design — giving each model its own pair of entitlements means you can limit Sonnet and Haiku independently. Customers can't burn their Haiku budget on Sonnet. If you want a unified token pool across models, use a single abstract credit for enforcement and map all model credits to it in the exchange table — but you lose per-model visibility and can't apply different limits per model tier.

Streaming responses — for streaming, meter at stream completion, not at stream start. Buffer usage.output_tokens from the final chunk and call allow(customerId, outputEntitlement, finalCount) after the stream closes.