路 nervico-team 路 cloud-architecture 路 10 min read
AWS Bedrock for AI: Integration with Agents and Applications
Technical guide to AWS Bedrock: how to integrate foundation models into applications, build AI agents, implement RAG, and manage inference costs in production.
Integrating generative AI into a production application is not about calling an API and displaying the response. It is about managing latency, costs, hallucinations, data security, and availability. And doing it on infrastructure you already have on AWS, without moving data to external providers and without managing GPUs.
AWS Bedrock is the service that solves this problem. It provides access to foundation models from Anthropic (Claude), Meta (Llama), Mistral, Amazon (Titan), and others through a unified API, with no need to provision ML infrastructure. Your data does not leave your AWS account. It is not used to train the models. And scaling is automatic.
This article explains how Bedrock works in practice, how to integrate models into applications, how to build agents that execute actions, how to implement RAG with your company鈥檚 data, and what it actually costs.
What AWS Bedrock Is
Managed Foundation Model Service
Bedrock is not an AI model. It is a platform that provides access to multiple models through a standard API. You do not need to train models, manage GPUs, or install ML frameworks.
Available models (primary):
| Provider | Model | Context | Strength |
|---|---|---|---|
| Anthropic | Claude 3.5 Sonnet | 200K tokens | Reasoning, code, analysis |
| Anthropic | Claude 3 Haiku | 200K tokens | Speed, low cost |
| Meta | Llama 3.1 70B | 128K tokens | Open source, customizable |
| Meta | Llama 3.1 8B | 128K tokens | Light tasks, minimal cost |
| Mistral | Mistral Large | 128K tokens | Multilingual, European |
| Amazon | Titan Text Express | 8K tokens | Low cost, simple tasks |
What Bedrock manages for you:
- Inference infrastructure (GPUs, scaling, availability)
- Model versioning
- Security and compliance (data in your VPC, encryption in transit and at rest)
- Logging and monitoring with CloudWatch
- Automatic rate limiting and throttling
What Bedrock Does Not Do
- It does not train custom models from scratch. It allows fine-tuning of some models, but not complete training.
- It does not guarantee constant latency. Response times vary between 1-30 seconds depending on the model and prompt length.
- It does not eliminate hallucinations. Models still generate incorrect information. Bedrock provides guardrails, but verification responsibility is yours.
Basic API Integration
Model Invocation
The most direct way to use Bedrock is the InvokeModel API:
import boto3
import json
bedrock_runtime = boto3.client(
service_name='bedrock-runtime',
region_name='us-east-1'
)
def invoke_claude(prompt, max_tokens=1024):
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": max_tokens,
"messages": [
{
"role": "user",
"content": prompt
}
],
"temperature": 0.3
})
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
body=body,
contentType="application/json",
accept="application/json"
)
result = json.loads(response['body'].read())
return result['content'][0]['text']
# Usage example
response = invoke_claude(
"Analyze this error log and suggest the root cause: "
"ERROR 2025-10-01 14:23:45 ConnectionPool exhausted, "
"max connections: 50, active: 50, waiting: 127"
)
print(response)Streaming for Long Responses
For long responses, streaming reduces perceived latency:
def invoke_claude_streaming(prompt, max_tokens=2048):
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": max_tokens,
"messages": [
{"role": "user", "content": prompt}
],
"temperature": 0.3
})
response = bedrock_runtime.invoke_model_with_response_stream(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
body=body,
contentType="application/json",
accept="application/json"
)
full_response = ""
for event in response['body']:
chunk = json.loads(event['chunk']['bytes'])
if chunk['type'] == 'content_block_delta':
text = chunk['delta'].get('text', '')
full_response += text
print(text, end='', flush=True)
return full_responseConverse API: Unified Interface
The Converse API is the recommended way to interact with Bedrock since 2024. It provides a consistent interface regardless of the model:
def converse_with_model(messages, model_id, system_prompt=None):
kwargs = {
"modelId": model_id,
"messages": messages,
"inferenceConfig": {
"maxTokens": 2048,
"temperature": 0.3,
"topP": 0.9
}
}
if system_prompt:
kwargs["system"] = [{"text": system_prompt}]
response = bedrock_runtime.converse(**kwargs)
return response['output']['message']['content'][0]['text']
# Usage with multi-turn conversation
messages = [
{
"role": "user",
"content": [{"text": "I am the CTO of a SaaS startup with 50K users. We are on AWS."}]
},
{
"role": "assistant",
"content": [{"text": "Understood. I can help with architecture, costs, or operations on AWS."}]
},
{
"role": "user",
"content": [{"text": "Our RDS database is at 85% CPU. Options?"}]
}
]
response = converse_with_model(
messages=messages,
model_id="anthropic.claude-3-5-sonnet-20241022-v2:0",
system_prompt="You are a senior AWS architect. Respond with concrete options and estimated costs."
)Bedrock Agents: AI That Executes Actions
What Bedrock Agents Are
Bedrock Agents go beyond generating text. They can reason about a task, decide which actions to execute, and call external APIs to complete the work. The agent receives a natural language instruction, decomposes the task into steps, executes each step by calling available tools, and returns the result.
User: "Find orders for customer ACME from the last month
and generate a summary with the total billed amount"
Agent:
1. Calls orders API with filter: customer=ACME, date=last month
2. Receives the list of orders
3. Calculates the total billed amount
4. Generates a natural language summary
5. Returns the result to the userConfiguring an Agent
A Bedrock Agent is composed of:
- Instructions: The system prompt that defines the agent鈥檚 behavior.
- Action Groups: The tools the agent can use, defined as API schemas (OpenAPI).
- Knowledge Bases: Knowledge bases (RAG) the agent can query.
import boto3
bedrock_agent = boto3.client('bedrock-agent', region_name='us-east-1')
# Create agent
response = bedrock_agent.create_agent(
agentName='customer-support-agent',
agentResourceRoleArn='arn:aws:iam::123456789:role/bedrock-agent-role',
foundationModel='anthropic.claude-3-5-sonnet-20241022-v2:0',
instruction="""
You are a technical support agent for a SaaS platform.
Your goal is to help customers resolve technical issues.
Rules:
- Check the customer's history before responding.
- If the issue requires escalation, create a ticket in the system.
- Never share information from one customer with another.
- Always respond in the customer's language.
""",
idleSessionTTLInSeconds=1800
)Action Group with OpenAPI schema:
openapi: 3.0.0
info:
title: Customer API
version: 1.0.0
paths:
/customers/{customerId}/orders:
get:
summary: Get customer orders
operationId: getCustomerOrders
parameters:
- name: customerId
in: path
required: true
schema:
type: string
- name: dateFrom
in: query
schema:
type: string
format: date
- name: dateTo
in: query
schema:
type: string
format: date
responses:
'200':
description: List of orders
/tickets:
post:
summary: Create support ticket
operationId: createTicket
requestBody:
required: true
content:
application/json:
schema:
type: object
properties:
customerId:
type: string
subject:
type: string
priority:
type: string
enum: [low, medium, high, critical]
description:
type: stringThe agent decides when and how to call each endpoint based on the conversation with the user. You do not need to program the decision logic: the foundation model handles it.
RAG with Bedrock Knowledge Bases
What RAG Is and Why It Matters
RAG (Retrieval-Augmented Generation) is the pattern that allows an AI model to answer questions about your company鈥檚 data. Instead of relying solely on the knowledge it was trained on, the model searches for relevant information in your documents and uses it as context to generate the response.
Without RAG, the model will hallucinate about data it does not know. With RAG, the model bases its responses on real documents that you control.
Knowledge Base Configuration
Bedrock Knowledge Bases manages the complete RAG pipeline:
- Ingestion: Loads documents from S3 (PDF, Word, HTML, plain text, CSV).
- Chunking: Splits documents into configurable-size fragments.
- Embedding: Converts each fragment into a numerical vector using an embedding model.
- Storage: Stores vectors in a vector database (OpenSearch Serverless, Aurora PostgreSQL with pgvector, Pinecone).
- Query: When the user asks a question, Bedrock finds the most relevant fragments and includes them in the model鈥檚 prompt.
# Create Knowledge Base
response = bedrock_agent.create_knowledge_base(
name='product-documentation',
description='Product technical documentation',
roleArn='arn:aws:iam::123456789:role/bedrock-kb-role',
knowledgeBaseConfiguration={
'type': 'VECTOR',
'vectorKnowledgeBaseConfiguration': {
'embeddingModelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0'
}
},
storageConfiguration={
'type': 'OPENSEARCH_SERVERLESS',
'opensearchServerlessConfiguration': {
'collectionArn': 'arn:aws:aoss:us-east-1:123456789:collection/my-collection',
'vectorIndexName': 'product-docs-index',
'fieldMapping': {
'vectorField': 'embedding',
'textField': 'text',
'metadataField': 'metadata'
}
}
}
)Chunking Strategies
How you split documents directly affects response quality:
Fixed-size chunking: Splits into fragments of N tokens (300-500 is the optimal range for most cases). Simple but can cut information in the middle of an idea.
Hierarchical chunking: Creates parent chunks (complete sections) and child chunks (paragraphs). Search operates on child chunks, but context includes the parent chunk. Improves response coherence.
Semantic chunking: Splits at points where meaning changes. More complex to implement but produces more coherent fragments.
# Configure data source with hierarchical chunking
response = bedrock_agent.create_data_source(
knowledgeBaseId='kb-123456',
name='technical-docs',
dataSourceConfiguration={
'type': 'S3',
's3Configuration': {
'bucketArn': 'arn:aws:s3:::my-docs-bucket',
'inclusionPrefixes': ['docs/']
}
},
vectorIngestionConfiguration={
'chunkingConfiguration': {
'chunkingStrategy': 'HIERARCHICAL',
'hierarchicalChunkingConfiguration': {
'levelConfigurations': [
{'maxTokens': 1500}, # Parent chunks
{'maxTokens': 300} # Child chunks
],
'overlapTokens': 60
}
}
}
)Guardrails: AI Control
Bedrock Guardrails
Guardrails allow you to define policies the model must follow. They filter unwanted content, block sensitive topics, and verify that responses meet your criteria.
Types of guardrails:
- Content filters: Block violent, sexual, discriminatory, or hateful content.
- Denied topics: Prevent the model from discussing specific topics (competitors, non-public financial information, political opinions).
- Word filters: Block specific words or phrases in the response.
- Sensitive information filters: Detect and redact PII (names, emails, phone numbers, credit card numbers).
- Contextual grounding: Verify the response is based on provided context (reduces hallucinations in RAG).
# Create guardrail
response = bedrock_agent.create_guardrail(
name='production-guardrail',
description='Production guardrail',
contentPolicyConfig={
'filtersConfig': [
{'type': 'SEXUAL', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
{'type': 'VIOLENCE', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
{'type': 'HATE', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'}
]
},
topicPolicyConfig={
'topicsConfig': [
{
'name': 'competitor-information',
'definition': 'Information about competitor products or services',
'examples': [
'What do you think about [Competitor X]?',
'Is your product better than [Competitor Y]?'
],
'type': 'DENY'
}
]
},
sensitiveInformationPolicyConfig={
'piiEntitiesConfig': [
{'type': 'EMAIL', 'action': 'ANONYMIZE'},
{'type': 'PHONE', 'action': 'ANONYMIZE'},
{'type': 'CREDIT_DEBIT_CARD_NUMBER', 'action': 'BLOCK'}
]
}
)Bedrock Costs
Billing Model
Bedrock charges per token processed. Billing separates input tokens (prompt) and output tokens (response):
| Model | Input (1M tokens) | Output (1M tokens) |
|---|---|---|
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3 Haiku | $0.25 | $1.25 |
| Llama 3.1 70B | $2.65 | $3.50 |
| Llama 3.1 8B | $0.22 | $0.22 |
| Titan Text Express | $0.20 | $0.60 |
Provisioned Throughput
For predictable workloads, Provisioned Throughput offers reserved capacity:
- Guarantees a minimum number of tokens per minute.
- Fixed hourly cost (independent of usage).
- Useful if you process more than 100,000 requests per day with the same model.
Monthly Cost Example
| Use Case | Model | Requests/Month | Avg Tokens (in/out) | Monthly Cost |
|---|---|---|---|---|
| Support chatbot (low) | Haiku | 10,000 | 500/200 | $3.75 |
| Support chatbot (medium) | Sonnet | 50,000 | 1,000/500 | $525 |
| Document analysis | Sonnet | 5,000 | 5,000/2,000 | $225 |
| Enterprise RAG | Sonnet + embedding | 20,000 | 2,000/1,000 | $420 |
Cost optimization:
- Use the smallest model that works: For text classification or simple extraction, Haiku is 12 times cheaper than Sonnet and responds in milliseconds.
- Reduce the prompt: Every unnecessary token in the system prompt multiplies by the number of requests. A 500-token system prompt with 100,000 requests/month costs an additional $150 with Sonnet.
- Cache responses for repetitive queries: If 20% of questions repeat, a cache in ElastiCache or DynamoDB reduces costs by 20%.
Production Architecture
Complete Pattern
Client -> API Gateway -> Lambda -> Bedrock Runtime
|
+----------+----------+
| | |
Guardrails Knowledge Agent
Base Actions
| |
OpenSearch Lambda
(RAG) (tools)Components:
- API Gateway: Entry point with authentication and rate limiting.
- Lambda: Orchestrates the Bedrock call, manages conversation context, and applies business logic.
- Bedrock Runtime: Runs model inference.
- Guardrails: Filters input and output.
- Knowledge Base: Provides RAG context.
- Agent Actions: Executes tools (database queries, internal API calls).
Conversation State Management
Bedrock does not maintain state between calls. Your application must manage the conversation history:
import boto3
from datetime import datetime
dynamodb = boto3.resource('dynamodb')
conversations_table = dynamodb.Table('conversations')
def get_conversation_history(session_id, max_messages=20):
response = conversations_table.get_item(
Key={'session_id': session_id}
)
if 'Item' in response:
messages = response['Item'].get('messages', [])
return messages[-max_messages:]
return []
def save_message(session_id, role, content):
conversations_table.update_item(
Key={'session_id': session_id},
UpdateExpression='SET messages = list_append(if_not_exists(messages, :empty), :msg), updated_at = :now',
ExpressionAttributeValues={
':msg': [{'role': role, 'content': content, 'timestamp': datetime.utcnow().isoformat()}],
':empty': [],
':now': datetime.utcnow().isoformat()
}
)Common Mistakes When Integrating AI with Bedrock
Mistake 1: Not Managing Latency
A call to Claude 3.5 Sonnet with a 2,000-token prompt takes 3-8 seconds. If your API has a 3-second timeout, requests will fail intermittently. Configure generous timeouts (30-60 seconds) and use streaming to improve perceived experience.
Mistake 2: System Prompts That Are Too Long
Every system prompt token is processed on every invocation. A 3,000-token system prompt with 50,000 requests/month costs an additional $450 in input tokens alone with Sonnet. Be concise.
Mistake 3: Not Implementing Fallbacks
If Bedrock returns an error (throttling, timeout, model error), your application should not show a generic error. Implement:
- Retry with exponential backoff for transient errors.
- Fallback to a lighter model if the primary model is unavailable.
- Predefined responses for frequently asked questions that do not need AI.
Mistake 4: Ignoring Prompt Security
Users may attempt to inject instructions into the prompt to manipulate the model鈥檚 behavior. Use guardrails, validate user input, and never include sensitive data in the prompt that you do not want the model to potentially repeat.
Conclusion
AWS Bedrock simplifies the integration of generative AI into production applications. It eliminates ML infrastructure management, keeps data within your AWS account, and provides security tools like guardrails and native encryption.
But it is not magic. Models hallucinate, latency is variable, and costs accumulate fast if you do not optimize prompts and models. The key is to start simple (an API call with the right model), add RAG when you need answers based on your own data, and build agents only when the task requires executing actions.
If you are evaluating how to integrate AI into your product or need help designing an AI architecture on AWS, our team has direct experience with Bedrock in production. Request a free audit to evaluate AI opportunities in your application.