AI Token Metering

Multi-model token metering.

Multi-model token metering with separate input/output credits, tiered output pricing, monthly credit grants that buffer soft-limit overage, and the pre/post-call enforcement split that prevents you from paying for requests you've already blocked.

Policy

policy:
  credits:
    # Discrete: one per model direction.
    # overhead_cost is what you pay; price is what you charge.
    sonnet_input:
      description: Claude Sonnet 4 — input tokens
      overhead_cost: 0.000003
      pricing_model: flat
      price: { amount: 0.000004 }
      stof_units: int
      resets: true

    sonnet_output:
      description: Claude Sonnet 4 — output tokens
      overhead_cost: 0.000015
      pricing_model: tiered
      tiers:
        - up_to: 200000
          price: { amount: 0.000022 }
        - up_to: 1000000
          price: { amount: 0.000020 }
        -
          price: { amount: 0.000018 }
      stof_units: int
      resets: true

    haiku_input:
      description: Claude Haiku 4.5 — input tokens
      overhead_cost: 0.0000008
      pricing_model: flat
      price: { amount: 0.000001 }
      stof_units: int
      resets: true

    haiku_output:
      description: Claude Haiku 4.5 — output tokens
      overhead_cost: 0.000001
      pricing_model: flat
      price: { amount: 0.0000015 }
      stof_units: int
      resets: true

    # Abstract: the customer-facing unit. Maps to model tokens via exchange.
    ai_credit:
      description: AI Credits
      label: AI Credit
      unit: credit

  exchange:
    rune:          { value: 1,         currency: usd }
    ai_credit:     { value: 1.25,      currency: rune }
    sonnet_input:  { value: 0.000004,  currency: ai_credit }
    sonnet_output: { value: 0.000020,  currency: ai_credit }
    haiku_input:   { value: 0.000001,  currency: ai_credit }
    haiku_output:  { value: 0.0000015, currency: ai_credit }

  plans:
    starter:
      label: Starter
      period: monthly
      default: true
      entitlements:
        chat_access:
          description: Access to AI chat

        sonnet_input:
          limit: { credit: sonnet_input, mode: hard, value: 500000,  resets: true, reset_inc: 1day }
        sonnet_output:
          limit: { credit: sonnet_output, mode: hard, value: 200000, resets: true, reset_inc: 1day }
        haiku_input:
          limit: { credit: haiku_input, mode: hard, value: 2000000,  resets: true, reset_inc: 1day }
        haiku_output:
          limit: { credit: haiku_output, mode: hard, value: 1000000, resets: true, reset_inc: 1day }

    growth:
      label: Growth
      period: monthly
      entitlements:
        chat_access:
          description: Access to AI chat

        sonnet_input:
          limit: { credit: sonnet_input, mode: soft, value: 2000000,  resets: true, reset_inc: 1day }
        sonnet_output:
          limit: { credit: sonnet_output, mode: soft, value: 800000,  resets: true, reset_inc: 1day }
        haiku_input:
          limit: { credit: haiku_input, mode: soft, value: 10000000,  resets: true, reset_inc: 1day }
        haiku_output:
          limit: { credit: haiku_output, mode: soft, value: 5000000,  resets: true, reset_inc: 1day }

      topups:
        # Included monthly: 50 AI credits buffer soft-limit overage
        monthly_credits:
          description: 50 AI credits included monthly
          credit: ai_credit
          value: 50
          included: true
          resets: true
          reset_inc: 30days
          reset_mode: hard    # unused credits don't roll over

        # Purchasable: 200 AI credits, valid 90 days
        credit_pack_200:
          description: 200 AI credits
          credit: ai_credit
          value: 200
          price: { amount: 24.00 }
          expires_after: 90days

Integration

import { Limitr } from '@formata/limitr';
import { readFileSync } from 'fs';

const policy = await Limitr.new(readFileSync('./policy.yaml', 'utf-8'), 'yaml');

// Wire up overage billing before any requests are processed
policy.addHandler('billing', (key: string, value: unknown) => {
  if (key === 'meter-overage') {
    const event = JSON.parse(value as string);
    // event.overage is the amount not covered by the monthly grant
    billing.queueCharge({
      customerId:  event.customer.id,
      credit:      event.credit.description,
      units:       event.overage,
      entitlement: event.entitlement,
    });
  }
});

type Model = 'sonnet' | 'haiku';

async function handleChatRequest(
  customerId: string,
  prompt: string,
  model: Model = 'sonnet'
) {
  const inputEntitlement  = `${model}_input`;
  const outputEntitlement = `${model}_output`;

  await policy.ensureCustomer(customerId, 'starter');

  // 1. Feature gate — does this customer have chat access?
  if (!await policy.check(customerId, 'chat_access')) {
    return { error: 'No chat access on this plan', code: 'NO_ACCESS' };
  }

  // 2. Pre-flight check on estimated input tokens.
  //    Use check() — don't consume yet in case the LLM call errors.
  const estimatedInput = estimateTokens(prompt);
  if (!await policy.check(customerId, inputEntitlement, estimatedInput)) {
    const remaining = await policy.remaining(customerId, inputEntitlement);
    return { error: 'Daily token limit reached', code: 'LIMIT_REACHED', remaining };
  }

  // 3. Call the LLM
  const response = await llm.complete({ model, prompt });
  const actualOutput = response.usage.output_tokens;

  // 4. Meter actual input — allow() enforces and meters in one operation.
  await policy.allow(customerId, inputEntitlement, estimatedInput);

  // 5. Meter actual output post-call — you don't know the count until the response arrives.
  await policy.allow(customerId, outputEntitlement, actualOutput);

  return { content: response.content, usage: { input: estimatedInput, output: actualOutput } };
}

Notes

Why separate input and output credits? Input and output have different costs and different margins. A single token credit makes margin tracking inaccurate and prevents applying different pricing models per direction — e.g. tiered output pricing to incentivize concise responses. Separate them.

Why check() before the call, then allow() after? You don't have actual token counts until the LLM responds. check() prevents blocking a request you'll then run anyway. allow() after the call meters the real counts. If using hard limits, make sure what you check is what you consume — if actual input differs significantly from estimated, consider splitting them into separate check/allow operations.

Why tiered output pricing on Growth? Output tokens are expensive. Tiered pricing gives customers a rate benefit at lower volumes and protects your margin at high volumes — the same structure most model providers use. The tiered model applies each band's rate only to units within that band (graduated, not retroactive). Use volume if you want a single rate applied retroactively once a threshold is crossed.

The monthly grant on Growth — monthly_credits is included: true, so all Growth customers get 50 AI credits automatically. These are drawn against overage before any meter-overage event fires. At the exchange rate, 50 AI credits cover roughly 250,000 Sonnet input tokens of overage — a meaningful buffer that reduces billing noise for customers who occasionally spike.

Multi-model plan design — giving each model its own pair of entitlements means you can limit Sonnet and Haiku independently. Customers can't burn their Haiku budget on Sonnet. If you want a unified token pool across models, use a single abstract credit for enforcement and map all model credits to it in the exchange table — but you lose per-model visibility and can't apply different limits per model tier.

Streaming responses — for streaming, meter at stream completion, not at stream start. Buffer usage.output_tokens from the final chunk and call allow(customerId, outputEntitlement, finalCount) after the stream closes.

Policy​

Integration​

Notes​

Policy

Integration

Notes