nervico-teamcloud-architecture  路 10 min read

AWS Bedrock for AI: Integration with Agents and Applications

Technical guide to AWS Bedrock: how to integrate foundation models into applications, build AI agents, implement RAG, and manage inference costs in production.

Technical guide to AWS Bedrock: how to integrate foundation models into applications, build AI agents, implement RAG, and manage inference costs in production.

Integrating generative AI into a production application is not about calling an API and displaying the response. It is about managing latency, costs, hallucinations, data security, and availability. And doing it on infrastructure you already have on AWS, without moving data to external providers and without managing GPUs.

AWS Bedrock is the service that solves this problem. It provides access to foundation models from Anthropic (Claude), Meta (Llama), Mistral, Amazon (Titan), and others through a unified API, with no need to provision ML infrastructure. Your data does not leave your AWS account. It is not used to train the models. And scaling is automatic.

This article explains how Bedrock works in practice, how to integrate models into applications, how to build agents that execute actions, how to implement RAG with your company鈥檚 data, and what it actually costs.

What AWS Bedrock Is

Managed Foundation Model Service

Bedrock is not an AI model. It is a platform that provides access to multiple models through a standard API. You do not need to train models, manage GPUs, or install ML frameworks.

Available models (primary):

ProviderModelContextStrength
AnthropicClaude 3.5 Sonnet200K tokensReasoning, code, analysis
AnthropicClaude 3 Haiku200K tokensSpeed, low cost
MetaLlama 3.1 70B128K tokensOpen source, customizable
MetaLlama 3.1 8B128K tokensLight tasks, minimal cost
MistralMistral Large128K tokensMultilingual, European
AmazonTitan Text Express8K tokensLow cost, simple tasks

What Bedrock manages for you:

  • Inference infrastructure (GPUs, scaling, availability)
  • Model versioning
  • Security and compliance (data in your VPC, encryption in transit and at rest)
  • Logging and monitoring with CloudWatch
  • Automatic rate limiting and throttling

What Bedrock Does Not Do

  • It does not train custom models from scratch. It allows fine-tuning of some models, but not complete training.
  • It does not guarantee constant latency. Response times vary between 1-30 seconds depending on the model and prompt length.
  • It does not eliminate hallucinations. Models still generate incorrect information. Bedrock provides guardrails, but verification responsibility is yours.

Basic API Integration

Model Invocation

The most direct way to use Bedrock is the InvokeModel API:

import boto3
import json

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-east-1'
)

def invoke_claude(prompt, max_tokens=1024):
    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": max_tokens,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "temperature": 0.3
    })

    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
        body=body,
        contentType="application/json",
        accept="application/json"
    )

    result = json.loads(response['body'].read())
    return result['content'][0]['text']

# Usage example
response = invoke_claude(
    "Analyze this error log and suggest the root cause: "
    "ERROR 2025-10-01 14:23:45 ConnectionPool exhausted, "
    "max connections: 50, active: 50, waiting: 127"
)
print(response)

Streaming for Long Responses

For long responses, streaming reduces perceived latency:

def invoke_claude_streaming(prompt, max_tokens=2048):
    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": max_tokens,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.3
    })

    response = bedrock_runtime.invoke_model_with_response_stream(
        modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
        body=body,
        contentType="application/json",
        accept="application/json"
    )

    full_response = ""
    for event in response['body']:
        chunk = json.loads(event['chunk']['bytes'])
        if chunk['type'] == 'content_block_delta':
            text = chunk['delta'].get('text', '')
            full_response += text
            print(text, end='', flush=True)

    return full_response

Converse API: Unified Interface

The Converse API is the recommended way to interact with Bedrock since 2024. It provides a consistent interface regardless of the model:

def converse_with_model(messages, model_id, system_prompt=None):
    kwargs = {
        "modelId": model_id,
        "messages": messages,
        "inferenceConfig": {
            "maxTokens": 2048,
            "temperature": 0.3,
            "topP": 0.9
        }
    }

    if system_prompt:
        kwargs["system"] = [{"text": system_prompt}]

    response = bedrock_runtime.converse(**kwargs)
    return response['output']['message']['content'][0]['text']

# Usage with multi-turn conversation
messages = [
    {
        "role": "user",
        "content": [{"text": "I am the CTO of a SaaS startup with 50K users. We are on AWS."}]
    },
    {
        "role": "assistant",
        "content": [{"text": "Understood. I can help with architecture, costs, or operations on AWS."}]
    },
    {
        "role": "user",
        "content": [{"text": "Our RDS database is at 85% CPU. Options?"}]
    }
]

response = converse_with_model(
    messages=messages,
    model_id="anthropic.claude-3-5-sonnet-20241022-v2:0",
    system_prompt="You are a senior AWS architect. Respond with concrete options and estimated costs."
)

Bedrock Agents: AI That Executes Actions

What Bedrock Agents Are

Bedrock Agents go beyond generating text. They can reason about a task, decide which actions to execute, and call external APIs to complete the work. The agent receives a natural language instruction, decomposes the task into steps, executes each step by calling available tools, and returns the result.

User: "Find orders for customer ACME from the last month
       and generate a summary with the total billed amount"

Agent:
  1. Calls orders API with filter: customer=ACME, date=last month
  2. Receives the list of orders
  3. Calculates the total billed amount
  4. Generates a natural language summary
  5. Returns the result to the user

Configuring an Agent

A Bedrock Agent is composed of:

  • Instructions: The system prompt that defines the agent鈥檚 behavior.
  • Action Groups: The tools the agent can use, defined as API schemas (OpenAPI).
  • Knowledge Bases: Knowledge bases (RAG) the agent can query.
import boto3

bedrock_agent = boto3.client('bedrock-agent', region_name='us-east-1')

# Create agent
response = bedrock_agent.create_agent(
    agentName='customer-support-agent',
    agentResourceRoleArn='arn:aws:iam::123456789:role/bedrock-agent-role',
    foundationModel='anthropic.claude-3-5-sonnet-20241022-v2:0',
    instruction="""
    You are a technical support agent for a SaaS platform.
    Your goal is to help customers resolve technical issues.

    Rules:
    - Check the customer's history before responding.
    - If the issue requires escalation, create a ticket in the system.
    - Never share information from one customer with another.
    - Always respond in the customer's language.
    """,
    idleSessionTTLInSeconds=1800
)

Action Group with OpenAPI schema:

openapi: 3.0.0
info:
  title: Customer API
  version: 1.0.0
paths:
  /customers/{customerId}/orders:
    get:
      summary: Get customer orders
      operationId: getCustomerOrders
      parameters:
        - name: customerId
          in: path
          required: true
          schema:
            type: string
        - name: dateFrom
          in: query
          schema:
            type: string
            format: date
        - name: dateTo
          in: query
          schema:
            type: string
            format: date
      responses:
        '200':
          description: List of orders
  /tickets:
    post:
      summary: Create support ticket
      operationId: createTicket
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              properties:
                customerId:
                  type: string
                subject:
                  type: string
                priority:
                  type: string
                  enum: [low, medium, high, critical]
                description:
                  type: string

The agent decides when and how to call each endpoint based on the conversation with the user. You do not need to program the decision logic: the foundation model handles it.

RAG with Bedrock Knowledge Bases

What RAG Is and Why It Matters

RAG (Retrieval-Augmented Generation) is the pattern that allows an AI model to answer questions about your company鈥檚 data. Instead of relying solely on the knowledge it was trained on, the model searches for relevant information in your documents and uses it as context to generate the response.

Without RAG, the model will hallucinate about data it does not know. With RAG, the model bases its responses on real documents that you control.

Knowledge Base Configuration

Bedrock Knowledge Bases manages the complete RAG pipeline:

  1. Ingestion: Loads documents from S3 (PDF, Word, HTML, plain text, CSV).
  2. Chunking: Splits documents into configurable-size fragments.
  3. Embedding: Converts each fragment into a numerical vector using an embedding model.
  4. Storage: Stores vectors in a vector database (OpenSearch Serverless, Aurora PostgreSQL with pgvector, Pinecone).
  5. Query: When the user asks a question, Bedrock finds the most relevant fragments and includes them in the model鈥檚 prompt.
# Create Knowledge Base
response = bedrock_agent.create_knowledge_base(
    name='product-documentation',
    description='Product technical documentation',
    roleArn='arn:aws:iam::123456789:role/bedrock-kb-role',
    knowledgeBaseConfiguration={
        'type': 'VECTOR',
        'vectorKnowledgeBaseConfiguration': {
            'embeddingModelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0'
        }
    },
    storageConfiguration={
        'type': 'OPENSEARCH_SERVERLESS',
        'opensearchServerlessConfiguration': {
            'collectionArn': 'arn:aws:aoss:us-east-1:123456789:collection/my-collection',
            'vectorIndexName': 'product-docs-index',
            'fieldMapping': {
                'vectorField': 'embedding',
                'textField': 'text',
                'metadataField': 'metadata'
            }
        }
    }
)

Chunking Strategies

How you split documents directly affects response quality:

Fixed-size chunking: Splits into fragments of N tokens (300-500 is the optimal range for most cases). Simple but can cut information in the middle of an idea.

Hierarchical chunking: Creates parent chunks (complete sections) and child chunks (paragraphs). Search operates on child chunks, but context includes the parent chunk. Improves response coherence.

Semantic chunking: Splits at points where meaning changes. More complex to implement but produces more coherent fragments.

# Configure data source with hierarchical chunking
response = bedrock_agent.create_data_source(
    knowledgeBaseId='kb-123456',
    name='technical-docs',
    dataSourceConfiguration={
        'type': 'S3',
        's3Configuration': {
            'bucketArn': 'arn:aws:s3:::my-docs-bucket',
            'inclusionPrefixes': ['docs/']
        }
    },
    vectorIngestionConfiguration={
        'chunkingConfiguration': {
            'chunkingStrategy': 'HIERARCHICAL',
            'hierarchicalChunkingConfiguration': {
                'levelConfigurations': [
                    {'maxTokens': 1500},  # Parent chunks
                    {'maxTokens': 300}    # Child chunks
                ],
                'overlapTokens': 60
            }
        }
    }
)

Guardrails: AI Control

Bedrock Guardrails

Guardrails allow you to define policies the model must follow. They filter unwanted content, block sensitive topics, and verify that responses meet your criteria.

Types of guardrails:

  • Content filters: Block violent, sexual, discriminatory, or hateful content.
  • Denied topics: Prevent the model from discussing specific topics (competitors, non-public financial information, political opinions).
  • Word filters: Block specific words or phrases in the response.
  • Sensitive information filters: Detect and redact PII (names, emails, phone numbers, credit card numbers).
  • Contextual grounding: Verify the response is based on provided context (reduces hallucinations in RAG).
# Create guardrail
response = bedrock_agent.create_guardrail(
    name='production-guardrail',
    description='Production guardrail',
    contentPolicyConfig={
        'filtersConfig': [
            {'type': 'SEXUAL', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
            {'type': 'VIOLENCE', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
            {'type': 'HATE', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'}
        ]
    },
    topicPolicyConfig={
        'topicsConfig': [
            {
                'name': 'competitor-information',
                'definition': 'Information about competitor products or services',
                'examples': [
                    'What do you think about [Competitor X]?',
                    'Is your product better than [Competitor Y]?'
                ],
                'type': 'DENY'
            }
        ]
    },
    sensitiveInformationPolicyConfig={
        'piiEntitiesConfig': [
            {'type': 'EMAIL', 'action': 'ANONYMIZE'},
            {'type': 'PHONE', 'action': 'ANONYMIZE'},
            {'type': 'CREDIT_DEBIT_CARD_NUMBER', 'action': 'BLOCK'}
        ]
    }
)

Bedrock Costs

Billing Model

Bedrock charges per token processed. Billing separates input tokens (prompt) and output tokens (response):

ModelInput (1M tokens)Output (1M tokens)
Claude 3.5 Sonnet$3.00$15.00
Claude 3 Haiku$0.25$1.25
Llama 3.1 70B$2.65$3.50
Llama 3.1 8B$0.22$0.22
Titan Text Express$0.20$0.60

Provisioned Throughput

For predictable workloads, Provisioned Throughput offers reserved capacity:

  • Guarantees a minimum number of tokens per minute.
  • Fixed hourly cost (independent of usage).
  • Useful if you process more than 100,000 requests per day with the same model.

Monthly Cost Example

Use CaseModelRequests/MonthAvg Tokens (in/out)Monthly Cost
Support chatbot (low)Haiku10,000500/200$3.75
Support chatbot (medium)Sonnet50,0001,000/500$525
Document analysisSonnet5,0005,000/2,000$225
Enterprise RAGSonnet + embedding20,0002,000/1,000$420

Cost optimization:

  1. Use the smallest model that works: For text classification or simple extraction, Haiku is 12 times cheaper than Sonnet and responds in milliseconds.
  2. Reduce the prompt: Every unnecessary token in the system prompt multiplies by the number of requests. A 500-token system prompt with 100,000 requests/month costs an additional $150 with Sonnet.
  3. Cache responses for repetitive queries: If 20% of questions repeat, a cache in ElastiCache or DynamoDB reduces costs by 20%.

Production Architecture

Complete Pattern

Client -> API Gateway -> Lambda -> Bedrock Runtime
                                      |
                           +----------+----------+
                           |          |          |
                      Guardrails  Knowledge   Agent
                                   Base      Actions
                                      |          |
                                 OpenSearch   Lambda
                                   (RAG)     (tools)

Components:

  • API Gateway: Entry point with authentication and rate limiting.
  • Lambda: Orchestrates the Bedrock call, manages conversation context, and applies business logic.
  • Bedrock Runtime: Runs model inference.
  • Guardrails: Filters input and output.
  • Knowledge Base: Provides RAG context.
  • Agent Actions: Executes tools (database queries, internal API calls).

Conversation State Management

Bedrock does not maintain state between calls. Your application must manage the conversation history:

import boto3
from datetime import datetime

dynamodb = boto3.resource('dynamodb')
conversations_table = dynamodb.Table('conversations')

def get_conversation_history(session_id, max_messages=20):
    response = conversations_table.get_item(
        Key={'session_id': session_id}
    )
    if 'Item' in response:
        messages = response['Item'].get('messages', [])
        return messages[-max_messages:]
    return []

def save_message(session_id, role, content):
    conversations_table.update_item(
        Key={'session_id': session_id},
        UpdateExpression='SET messages = list_append(if_not_exists(messages, :empty), :msg), updated_at = :now',
        ExpressionAttributeValues={
            ':msg': [{'role': role, 'content': content, 'timestamp': datetime.utcnow().isoformat()}],
            ':empty': [],
            ':now': datetime.utcnow().isoformat()
        }
    )

Common Mistakes When Integrating AI with Bedrock

Mistake 1: Not Managing Latency

A call to Claude 3.5 Sonnet with a 2,000-token prompt takes 3-8 seconds. If your API has a 3-second timeout, requests will fail intermittently. Configure generous timeouts (30-60 seconds) and use streaming to improve perceived experience.

Mistake 2: System Prompts That Are Too Long

Every system prompt token is processed on every invocation. A 3,000-token system prompt with 50,000 requests/month costs an additional $450 in input tokens alone with Sonnet. Be concise.

Mistake 3: Not Implementing Fallbacks

If Bedrock returns an error (throttling, timeout, model error), your application should not show a generic error. Implement:

  • Retry with exponential backoff for transient errors.
  • Fallback to a lighter model if the primary model is unavailable.
  • Predefined responses for frequently asked questions that do not need AI.

Mistake 4: Ignoring Prompt Security

Users may attempt to inject instructions into the prompt to manipulate the model鈥檚 behavior. Use guardrails, validate user input, and never include sensitive data in the prompt that you do not want the model to potentially repeat.

Conclusion

AWS Bedrock simplifies the integration of generative AI into production applications. It eliminates ML infrastructure management, keeps data within your AWS account, and provides security tools like guardrails and native encryption.

But it is not magic. Models hallucinate, latency is variable, and costs accumulate fast if you do not optimize prompts and models. The key is to start simple (an API call with the right model), add RAG when you need answers based on your own data, and build agents only when the task requires executing actions.

If you are evaluating how to integrate AI into your product or need help designing an AI architecture on AWS, our team has direct experience with Bedrock in production. Request a free audit to evaluate AI opportunities in your application.

Back to Blog

Related Posts

View All Posts 禄