I’ve only attended re:Invent in person once. I quickly realized that apart from the mountain of swag you get to bring home, there’s very little advantage to being there in person. It’s in Vegas. There are more than fifty thousand attendees. Do I need to say more?

All the sessions are recorded with high quality audio/video and available via both the AWS web site and YouTube. Since then, I’ve cleared space in my diary in the week following re:Invent and settled down to watch the most interesting sessions at 1.5X speed. To save you the trouble of doing the same, here’s my take on the most interesting sessions.

If you’re looking for a comprehensive list of all the new announcements, then head over to the AWS News Blog.

Keynotes

Peter DeSantis - SVP AWS

Focus on infrastructure. Skip unless you’re interested in low level details of EC2 instances, networking, Nitro hardware, Lambda infrastructure.

New instance types
AWS internally using a new network protocol (SRD) as replacement for TCP. Optimized for AWS network topology, particularly ability to support multiple paths within same network connection. Big improvements for tail latency.
EBS now using SRD, big improvement in write latency.
Standard ENA network adaptor (interface exposed by all EC2 instances) can now transparently use SRD regardless of API you use to talk to adapter. Simple configuration change to enable “ENA Express”.
AWS Lambda infrastructure internal performance optimization
- Reducing cold start latency
- History from original EC2 instance per customer architecture to Firecracker micro-VM
- Announcing Lambda SnapStart - taking Firecracker snapshot at end of initialization so cold start can restore from snap shot rather than running initialization again.
- Effectively removes cold start penalty
- Some care needed for things that need to be different per instance - such as random number seeds.
- Using layered snapshot system so that common parts of runtime (e.g. OS, language runtime) can be stored once.
- Profile how blocks of snapshot are accessed when snapshot boots and use that to do predictive loading for future boots
- For now JVM only

Adam Selipsky - CEO AWS

The high level AWS pitch you’ve heard many times before with a few fun facts and new announcements. I wouldn’t bother with this one.

Fun facts

80-90% of startups run on AWS
Scale of data managed by AWS
- Expedia runs 600 billion AI predictions a year
- Samsung handles 80 thousand user requests per second
- Pinterest stores 1 Exabyte of data
- Fortnite backend has a fleet of tens of thousands of game servers handling 13 million active players
- Nielson process hundreds of billions of ad events per day
- Dana-Farber spun up an HPC cluster with 5.6 million Xeon cores

New Announcements

OpenSearch Serverless in preview
Aurora Zero ETL integration with Redshift
Redshift integration for Apache Spark
Amazon DataZone - automated data catalog management
ML powered forecasting in QuickSight Q
Container runtime threat detection for GuardDuty
Amazon Security Lake - ingest data from 50 partners in OCSF format
Inf2 instances for EC2 - dedicated instances for ML inferences
Hpc6id instances for data and memory intensive workloads
AWS SimSpace Weaver - orchestration for spatial simulations distributed across instances
AWS Connect contact center as a service adding ML based capacity forecasting and agent performance management metrics
AWS Supply Chain - vertical application for supply chain management
AWS Clean Rooms - collaboration on data between low trust partners
Amazon Omics - vertical application for analyzing genomic and similar data sets

Werner Vogels - CTO AWS

“The World is Asynchronous”

Entertaining, thought provoking, well worth watching.

Lovely “The Matrix” parody intro showing us the horrors of a truly synchronous world
The real world is asynchronous - non-blocking parallelism
Synchronous is a simplification to make programming easier but is an illusion, at lowest hardware level it’s all asynchronous
Synchrony leads to tightly coupled systems, Asynchrony leads to loosely coupled systems
Loosely coupled systems are easy to evolve, more fault tolerant, less dependencies
S3 went from 15 -> 235 microservices
Newly published - AWS Distributed Computing Manifesto from 1998, allthingsdistributed.com
Workflows let us build applications from loosely coupled components
Customers do the unexpected - using STEP for map-reduce pattern
Announcing STEP distributed map - up to 10000 parallel invocations
Event driven architecture leads to loosely coupled systems. Three main patterns
- Point to point
- Publish / Subscribe
- Event streaming
Pushing EventBridge as central event bus that handles all of these patterns
“All complex systems that work evolved from a simple system that worked” - Gall’s law
Announce AWS Application Composer - visual canvas to make it easy to compose serverless architecture, maintain model of architecture, output CloudFormation
Events are composable - analogy with Unix
Can we compose AWS services?
Announcing AWS EventBridge pipes - build advanced integrations in minutes, built in filtering and manipulation of events
Event driven systems enable global scale
DynamoDB handles 10 trillion requests per day
Global tables implementation - local reads, local writes active-active architecture
- Entirely asynchronous approach
- Uses DynamoDB streams to feed replication service
- Replication service is multiple replicators each using an SQS queue plus workers
- Common pattern - change data capture + Asynchronous coupling + self healing replicators
Amazon builders library - 2 new articles released
- My CI/CD pipeline is my release captain
- Using dependency isolation to contain concurrency overload
Announcing Amazon CodeCatalyst
- removing undifferentiated heavy lifting needed to get app off the ground
- based on project blueprints
- set up CI/CD pipelines
- easy to switch between environments
- team collaboration features
- integrations with github, jira, visual studio, slack, …
3D will be as big as video
- Photogrammetry to create 3D product models
- ML advances allow you to create model from as few as 12 photos
- Investing in O3DE open source 3D engine
- Scanning of environments to put objects in - plug for Matterport
- 3D city simulation
Customer demo - Unreal Engine
- Matrix Awakens demo - 16 sq km, 4k vehicles, 35K meta-humans, billions of polygons
- Twinmotion - simulation playback in browser powered by AWS GPU instances
- RealityScan scanning app for phones using AWS backend
AWS Ambit Scenario Designer
- Open source tools to generate 3d content at scale
- Generate 3D city from open streetmap data
AWS SimSpace Weaver
- Spatial subdivision of simulated world into a grid
- Grid tiles distributed over a set of instances
- Handles objects that cross boundaries between tiles - handing off responsibility to another instance
Quantum Simulation
- Some processes (e.g. simulate penicillin molecule) that would need 2^N bits of memory in classical computer to simulate need N qubits in a quantum computer
- “Curiosity with Werner Vogels” - video on current state of quantum computers and future possibilities

Breakout Sessions

So many breakout sessions. No one could possibly look at them all. This is a random sampling of those with a title interesting enough to get me to press play. To cut the list down further I’m mostly focusing on 300 and 400 level talks.

Building next-gen applications with event-driven architectures

Entertaining presenter. Worth a watch.

Interesting ideas on the many different forms of coupling
Compare and contrast synchronous request-response with asynchronous patterns looking at pros/cons and forms of coupling
EventBridge for Orchestration
Sparse events vs full-state descriptions pros/cons
STEP functions for Choreography
Event de-duplication using idempotency

Reliable scalability: How Amazon.com scales in the cloud

Interesting to see the evolution of amazon.com. You’ll have seen all the patterns used by present day amazon.com before.

History of amazon.com architecture
1995: Single C executable monolith app server, single database
1998: Split data into multiple databases
2000: Service oriented architecture
2022: Prime day - running on AWS. 100m DynamoDB requests/second, 70 million SQS requests/second, 11 trillion EBS requests during the day, total of 288 billion Aurora transactions
Current architecture: thousands of microservices with crazy complex dependency graph
Examples of architecture related to well architected pillars
Example 1: IMDB from monolith to serverless microservices
- GraphQL queries from the front end
- Resolvers distribute request across the relevant microservices
- Implementation is CloudFront with WAF -> load balancer -> Lambda gateway -> Lambda microservices
- Gateway pulls schema from S3 bucket, polls for updates
- Architecture matches organizational structure
- 800K rpm to Lambda
- Autoscaling provisioned currency
- WAF instantly resolved issues with malicious bots
Example 2: Global Ops Robotics cell based architecture
- Warehouse management system - microservice architecture
- Services use cell based architecture with data partitioned by fulfillment center
- Redundant fulfillment centers in each geographic region, so if cell serving one fulfillment center goes down, there are others to fulfill order
- Important to make sure that each service maps fulfillment centers to cells in a consistent way so that failures remain constrained to small number of centers.
- Used central FC assignment built on dynamoDB
Example 3: Amazon relay app moves to multi-region architecture
- Manages movement of trucks between warehouses
- App used by truckers
- Relay mobile gateway = multiple AWS API gateways with DNS routing to gateway
- Multiple backend modules each using Lambda + DynamoDB
- Nothing shared between gateway and backend modules - independent deploys
- Multi-region trigger was SNS region wide outage
- Used dynamodb global tables and deployed backend modules + api gateway in each region. Setup route 53 with latency based routing policy. Active/active architecture
- Test resilience - shift 100% of traffic to one region. Less than 10 minutes to failover, small increase in latency.
Example 4: Amazon classification platform uses shuffle sharding
- Part of AWS product catalog - classifying products
- Used by 50 programs across Amazon, 10K machine learning models, 100K rules, 100 model updates per day, millions of product updates per hour
- Model updates grouped into batches of 200 items, multiple ML models run against each item, pool of ECS workers each applying ~30 models
- Models are pulled on demand from S3 and cached locally, results written to dynamoDB - millions per second
- Using shuffle sharding to assign models to workers, minimize impact of a poison model
- Each shard of workers has its own SQS queue. Lambda based router assigns work to shards also making use of back pressure signals from queue to balance work
Example 5: Amazon Search chaos engineering
- 84K requests/second during prime day
- 40 backend services using S3, API gateway, ECS, DynamoDB, SNS, Kinesis
- Dedicated resilience team
- Chaos experiments have to stay within SLO error budget - stop if burn rate too high
- Using fault injection simulator with purpose built chaos orchestrator to make it simple to use across their 40 services

SaaS microservices deep dive: Simplifying multi-tenant development

Engaging presenter. Interesting to see current best practices for implementing multi-tenant applications.

Base architecture - CloudFront -> API Gateway -> independent microservices
Tenant Concerns: Tiering, logging, metrics, billing, identity, isolation
- Front end: authentication, routing, feature flags
- API Gateway: authorization, throttling, caching
- Business logic: authorization, metrics, logging, metering
- Database: access, partitioning, isolation, backup/restore
- Infrastructure: Provisioning, isolation, maintenance, tenant lifecycle
Need tenant id throughout system. Determining tenant context needs to be part of login flow. Include as claim in JWT token.
Reusable helpers for getting tenant context, error reporting, metering, tiering, …
Partitioning != isolation
- Partitioning = tagging record by tenant id or having separate dynamo table for each tier
- isolation = setting up request specific iam credentials based on tenant context
- could have tenant specific iam roles and then use cognito identity pool role mapping. Easy to understand but doesn’t scale to lots of tenants
- Other option is to use multiple iam policies - e.g. combine identity and session policy to allow only intersection. Identity is static, session is defined at runtime. Session policy could use dynamodb request condition to restrict query to specific tenant.
Infrastructure and DevOps
- Impact of adding a new tenant on DevOps
- May need infrastructure changes
- Support for moving tiers, enable/disable, trial->paid
- Removing tenant if customer leaves
- Use Lambda layers to have one copy of shared code for multiple lambda functions
- Use sidecar containers to share background processes with multiple containers
- Use Lambda extensions for Envoy like interceptor patterns

How to understand and improve resiliency for serverless systems

Mostly a plug for resilience hub.

resiliency = ability to withstand partial and intermittent failures
If serverless is highly available automatically what is left to do?
Four capabilities: anticipate, monitoring, responding, learning
RTO and RPO
AWS Resilience Hub - give it your architecture (e.g. via CloudFormation) and it will tell you if you will meet your RTO/RPO goals.
Anticipate example: provisioned currency for lambda
Resilience hub: separate analysis for impact of regional, AZ, infrastructure and application failures
monitoring example: CloudWatch metrics. Important to define your own business/customer metrics as well as the built ins.
resilience hub will give recommendations for the cloudwatch metrics you should have in place
Responding example: Automated responses to events using EventBridge
Resilience hub provides recommendations to help build runbooks
Learning example: chaos engineering
Resilience hub recommendations for fault injection simulator tests with test templates that can be imported into FIS or run directly from resilience hub

Operating highly available Multi-AZ applications

Good overview of the typical journey to HA and then focuses on the AWS pattern of isolating the infrastructure in each AZ so that your fallback response to any problem is taking an entire AZ out of service. Finishes up as a plug for Route 53 Application Recovery Controller Zonal Shift which automates the process for taking an AZ out of service.

Start with identifying and mitigating single sources of failure
- Single server
- Single deployment
- Single config change
- Local network failure
- Infrastructure failure
- Power Failure
- Flood
- Storage Limits
- Heap mem exceeded
- TLS cert expiry
Architecture evolves from single server -> multiple servers -> multiple cells -> multiple AZs
At most advanced level start looking for possible correlated failures and setup cells differently (e.g. have TLS cert for each cell expire on a different date, or different heap memory limits)
Others remain: DB schema, storage, dependencies
Example replicated system: Route 53 with 4 name servers per domain
Hard failures are easy to detect and resolve (e.g. load balancer health checks)
Grey failures are difficult
- Bug causing repeated crash/restart
- Bad config change
- Reduced capacity
- Intermittent dependency failure
- Packet loss/latency
Grey failures are subjective - how much extra latency is OK
General approach - try and turn grey failures into hard failures
- Strategy 1: Deep health checks - properly exercise app logic in health check. Cheap to do but risk of overreaction and unlikely to catch all grey failures. Builders library has article on implementing health checks.
- Strategy 2: ELB minimum healthy targets. New ELB feature. Can take entire AZ out of service when AZ unlikely to be handle its share of hosts.
- Strategy 3: Detect and shift away from zone. Turns wide array of grey failures into single hard failure - AZ is out of service.
Rest of talk about how to do strategy 3
First need AZs to be independent. By default ELB instance in one AZ will distribute load across instances in all AZs. Reasonable out of box default that you need to turn off if implementing this strategy.
- Means load balancing between AZs is done by AZ. Works best for bigger applications (lots of instances per AZ), equal capacity in each AZ. Needs enough capacity that system will cope with loss of entire AZ.
- Need to setup monitoring so all your metrics are available on a per AZ basis. Can then compare across AZs, see which is unhealthy.
- Use canaries - e.g. CloudWatch synthetics. ELB provides per AZ domain names that allows you to point canaries at specific AZs. One canary per AZ and one per region overall.
New feature - Route 53 Application Recovery Controller Zonal shift
- Simple API call to shift load away from one zone temporarily. Includes expiry time.
- Included in ALB/NLB (if cross-zone turned off). No additional cost.
- Can also do from console
- Need to be sure other AZs are healthy and have spare capacity
- Works by removing AZ from DNS
- Gives time to diagnose and fix problem
- Can extend shift or cancel

Beyond five 9s: Lessons from our highest available data planes

Updated version of a talk that is given every year. Lots more details in the builders library. Recommended.

Rethinking the nines model
- Coarse, not great for systems design. Doesn’t fit cell based systems.
- Single summary statistic - doesn’t give great insight
- Instead think about what really matters - how long could an interruption last? RTO? How often could it happen? What is the Rate and Expected Duration? (RED)
- Plot rate vs expected duration, minutes per year vs expected duration. Add additional lines for 100% impact, 50% impact, 5% impact, etc.
- Nines model comes from mech and civil engineering - physical components have different failure characteristics with no recovery
- Nines model would suggest better resiliency using multiple DNS providers but in reality you’re now dependent on two providers being up.
Compartmentalization and blast radius
- Regional and AZ isolation (as in previous talk)
- Multiple cells within each AZ
- Then add shuffle sharding across cells
Simplicity and survival of the fittest
- What is simple? Not fewest moving parts (unicycle), fewest needed parts (bike)
- Systems need to evolve over time, eliminate faults over time, best patterns reused
Constant work and minimizing change
- Load spikes have all kinds of nasty effects
- Reducing dynamism in systems makes them simpler
- Running system at maximum work regardless of load can make it more reliable
- Classic example: instead of pushing changes to route 53 on every change, push on a fixed schedule
Testing as time travel
- “There is no compression algorithm for experience”
- For testing no substitute for having lots of tests and running them often
- Automated reasoning to prove correctness
- Lots and lots of metrics
Operational safety
- Most incidents triggered by changes
- Safe deployment process is vital - staged deployments at server, AZ, region level. Automated rollback on failure.
- Cattle vs pets - systems too big to manage by hand.
- Automation first. Any lesson learned is turned into a tool/automation
Technical fearlessness
- “culture eats strategy for breakfast”
- Elite teams insist on high standards. No budget for quality, quality is embedded in their approach.
- No dumb questions, anyone can be elite with the right support, criticism welcomed
- High standards -> maintainability/operability -> elite teams -> repeat
- Low standards -> unmaintainable -> unhappy teams -> repeat

Scaling on AWS for your first 10 million users

Updated version of talk given every year by developer advocates responsible for startups. Not very technical but helpful to see the overall story.

Journey for startup has changed drastically
Used to start with single EC2 instance, then split into layers, distribute across AZs
Frontends now all rich clients based on frameworks like react and flutter. Use AWS amplify to take code in repo, transform as needed and deploy to hosted platform behind CloudFront.
- continuous workflows, custom domain setup, feature branch deployments, atomic deployments, global availability
On backend most people now start with managed compute and managed database
APIs exposed via API gateway, ALB or AWS AppSync (GraphQL)
AWS App Runner - build/deploy/run containerized web apps and API servers
Database - start with SQL databases (Aurora serverless v2) until you need NoSQL
Gives you a single region, multi-AZ, scalable stack to start off with
Need to make changes to scale >10K users
Frontend: tuning, reducing backend calls, caching content more effectively
- Need to measure before we can tune - CloudWatch/X-Ray
- ML driven recommendations - DevOps Guru and CodeGuru
Database:
- Add read replicas for read heavy traffic, add RDS proxy to manage connection from large number of instances.
- Add caching - ElastiCache
Backend:
- App Runner uses ECS Fargate. Limited to 200 concurrent requests per instance and 25 instances per service = 5000 concurrent requests
- Can squeeze out more by tuning - looking at slow db queries, more caching.
Overall gets you to 100K maybe 1M users
- Need more scale than App Runner can provide
- DB writes start to become a bottleneck
- Next step is to break system into multiple micro-services
- Before that can look at DB federation. Keep monolithic app server tier but break DB into multiple DBs by function/purpose
- At some point will need to look at sharding SQL or shifting to NoSQL
- In backend tier may need to directly manage ECS rather than using app runner, look at different forms of compute, move from sync to async, look at event driven architectures

Architecting secure serverless applications

Lots of recommendations for specific things you can do to get security right for serverless applications. QR codes with links to further reading embedded in each slide.

Shifting responsibility for security to AWS and shift left to developers in your org
Serverless = small pieces, loosely joined
Use IAM for access control rather than VPC
Use least privilege principles
Serverless means AWS handles more infrastructure security responsibilities
Lambda: be aware of sandbox reuse during warm start - be careful when storing sensitive data in memory or /tmp
shift left: defense in depth, secure each component (e.g. using IAM)
Lambda IAM: function policy controls who can invoke function, execution role controls what function can do
Start narrow (no privileges) and add stuff. Look carefully at any wildcard * in policy.
AWS SAM policy templates cover lots of common least privilege cases
New Lambda function urn that can restrict permissions to a specific function urn rather than anything with the appropriate role
SQS now supports attribute based access controls (using message attributes)
Separate secrets from code. Use AWS secrets manager, not environment variables. Can then restrict permission to load secret to your Lambda. Load secret on init or each invoke as appropriate.
Validate untrusted event payloads. Validate before processing, before parsing. Use strict typing, add additional constraints to open ended types like strings.
ABAC makes management of large systems more scalable
- Policy that requires function to be tagged on create
Permission boundaries allows controlled delegation of policy creation to developers
Can control Lambda function access to internet by attaching it to a VPC and then using something like AWS network firewall. Expensive/scaling implications - only do if you really need it.

Designing event-driven integrations using Amazon EventBridge

I’ve always preferred to use SNS/SQS as an event bus rather than EventBridge because it fits more cleanly into a multi-account architecture with one account per service. Publisher owns SNS topic, sets up IAM permission for subscriber account to access topic, subscriber owns SQS queue and subscribes to topic with whatever filtering rules it wants. Clean and simple. With EventBridge it looks like the publisher has to do everything.

This talk goes deep into how to use EventBridge with multiple accounts, including how you can enable an appropriate division of responsibilities. Still seems like it could be a little messy but there’s enough here for me to take another look at EventBridge.

Event bus - account topologies
Service’s will typically have their own accounts so have to deal with multi-account topology
Most common pattern, especially when starting out is single event bus (in dedicated account) connected to multiple service accounts.
- Typically common DevOps team owns the event bus
- Lots of coordination to manage rules
Other option is multi-bus, multi account
- Each service has its own event bus owned by the service team
- Service team manages permissions for other services that subscribe to events
In both topologies there is actually an event bus in every account. Receiving teams can add their own rules to their event bus.
Separation between subscription rules that control what a service is sending to another service and integration rules that control what a service does with incoming events.
Limit of 2000 rules per event bus
Same content on coupling as “Building next-gen applications with event-driven architectures”
An event is
- A signal that a system’s state has changed
- An immutable object
- A contract for integrating systems
Event types: notification events, state transfer events, domain events - Martin Fowler definitions
Roles and responsibilities
- One logical publisher for each domain event
- Consistent boundary between bounded contexts
- Contract between producers and consumers represented as EventBridge schema
- Producers responsible for event consistency, conforming to contract, backwards compatibility
- Producer needs to know who is subscribing to what version of an event
- Producers should make no assumptions about how consumers are using events
- Consumers are responsible for specifying events they want to receive, rate at which they process them
Implementing in EventBridge
- Producer defines schemas
- Producer defines policies for who can publish events and who can create rules
- Producer creates events and owns archive of published events
- Consumer creates rules that route events to consumer’s destination
- Consumer requests replay of archived events if needed
Working with sensitive data - could encrypt payload and use policies to control who has access to keys. Better to avoid including PII in events.
Need some conventions for rule naming to avoid multiple consumers clashing
- e.g. subscriber domain + target + event source + event type
- BUT 64 char limit for rule names
Event uniqueness
- Consumers need to ensure idempotency
- e.g. With idempotency key as event id
- Lambda power tools has utilities for recording response from event processing and returning a duplicate response if invoked with duplicate event
- Power tools approach only works with near live events as it assumes an idempotency window of 5 minutes (which is TTL for recorded responses)
- If you have DLQ or replay events from archive need idempotency window » 5 min
- Assume that events will be duplicated and where possible handle in business logic

Best practices for advanced serverless developers

A lot of the same content I’ve seen in other talks - even some of the same slides. However, there are enough additional nuggets of wisdom to be worth a look.

Platform team enablement
- CI/CD pipelines
- Security guardrails - AWS control tower, policies, permission boundaries
- Reusable infrastructure as code patterns
Well architected framework serverless application lens
Event State
- Async patterns to reduce latency, improve resilience
- Use EventBridge
- “Events are the language of serverless applications”
- Balance between including additional state in event or having receiver call back
- Avoid exposing implementation details
Service-full Serverless
- Instead of event source -> lambda -> destination use direct service integrations where possible to remove the lambda glue logic
- Rather than putting all your backend logic in a single lambda, split concerns using api gateway, event bus, queues, STEP, etc. with multiple lambdas each focused on a minimal bit of business logic.
- Orchestration and Choreography again
- API destinations in EventBridge can directly call any external API
- Use lambda to transform not to transport
Fabulous Functions
- Asynchronous integration built into lambda for SQS, DynamoDB, Kinesis, … sources
- Custom runtimes and lambda extensions for more control over lambda init
- Only load what you need during init - another reason for small focused lambdas
- Optimize dependencies. Pull in just the specific SDK components you need, can save 100s of ms
- NodeJS v3 SDK is 3MB package rather than 8MB for v2. TCP connection use now on by default.
- Prefer Graviton instances
- Increase memory allocation if lambda is cpu or network constrained. Lambda power tuner is open source tool to find optimal configuration.
- Only attach to a VPC if you really need to
- Use reserved concurrency to ensure you have your share of account limit, prevent overloading downstream dependencies, emergency level to turn off processing
- Understand quotas and how scale up works
Configuration as code
- Automate provisioning
- AWS has SAM and cDK frameworks
- IAM access analyzer to check for security problems
- SAM outputs CloudFormation and now optionally TerraForm
- 75+ templates out of the box. Serverless patterns collection has many more.
Prototype to production
- Think about testing quickly rather than testing locally
- Use mock frameworks locally rather than complete service emulation
- Test in the cloud as soon as possible
- SAM accelerate makes it practical to develop in cloud - build what’s changed locally and sync changes with cloud
- Templates for deploying across multiple environments
- Use multiple delivery pipelines with shared templates to allow each component to deploy independently
- Test in production - deploy to subset of traffic. AppConfig Feature flags, CloudWatch evidently.
- Canaries - CloudWatch synthetics
- CloudWatch embedded metrics - auto-extract metrics from log entries
- CloudWatch logs insights
- Lots of observability stuff in lambda power tools

Building real-world serverless applications with AWS SAM

Same presenter as “Building next-gen applications with event-driven architectures” with the same jokes. Good introduction to SAM if you’ve never worked with it before.

SAM basics
SAM local
SAM accelerate
SAM deploy (dev environments)
SAM pipeline (CI/CD for staging/production)
- Now supports OIDC authorizer (use GitHub without IAM user)
Native esbuild support (Node tree shaking)
SAM connectors to describe how data and events flow between source/destination
Serverless Land
- Resources for this session
Let SAM name your resources
Use SAM delete for cleanup

Build your application easily & efficiently with serverless containers

Answers the question should I use Lambda, App Runner or Fargate? Includes interesting details on App Runner internals and how that’s reflected in the cost model.

Best tools for low concurrency vs high concurrency applications
Lambda best for low concurrency, very spiky load, or high variance in work per request (scales to 0, very responsive, separate VM per request)
At higher concurrency need to think about waiting for IO. With lambda you’re paying while you wait. ECS better for IO intensive workload where you have consistent baseline load.
Need to understand and set limit on number of concurrent requests per container
AWS App runner manages that for you on ECS Fargate
App Runner internally uses Envoy proxy load balancer
App Runner cost model charges for CPU when there are active requests, but only for memory when no requests (once inactive for 1 minute)
For highest concurrency run on Fargate directly
- Choose what load balancer you wnat to use (ALB, NLB or API gateway)
- Choose your own orchestrator (ECS recommended for serverless)
- Choose your scaling - Auto Scaling or based on CloudWatch metrics
- Fargate charges for containers * time, not traffic

Advanced serverless workflow patterns and best practices

YouTube Resources

Heavy focus on using STEP functions and thinking of everything in terms of workflows.

Step functions first, step functions always when building app
How to reduce cost
Dive into new distributed map feature
Workflow makes it easy to understand what’s happening in serverless system - nice ui for looking at history of requests, failures, etc.
REST API implemented as express workflow with branch per endpoint
Use express workflow when possible (much cheaper)
Need standard workflow if duration > 5 minutes, using wait for callback, need exactly once semantics
Use hybrid with standard workflow with nested express workflows to optimize cost
For example steps to transform input, call downstream service, transform output
Use intrinsic functions to avoid having to call out to a lambda for simple transformations
Avoid polling loops in STEP using callback pattern and emit and wait pattern
Implementing Saga using STEP
Circuit breaker as STEP workflow
Map-Reduce as STEP workflow
Limit of 40 on concurrent executions in map phase. New distributed map state supports 10K concurrent.
In distributed mode can choose whether to run map part as standard or express
Ability to pass getter for items into the map part rather than querying them all up front (avoids payload limits)

A day in the life of a billion requests

Deep dive into how IAM implements authentication at a rate of half a billion requests a second. Nothing that you’ll be able to use day to day but technically interesting.

Protocol designed in 2006 before SSL/TLS widely adopted - hence use of explicit signature in request
Signature includes timestamp to prevent replay attacks
Implement signature using SHA256 based HMAC. Much faster than public-private crypto. Even quantum safe.
Developer gets keys from AWS console
Problem now is how to validate signature at scale
Original solution was authentication request service deployed in each region. ARS called for each request. AWS SigV2. Doesn’t scale.
AWS SigV4 uses multiple nested hmac customer secret + date + region + service with output from each HMAC treated as the key for the next.
Allows each level to be cached at appropriate level (global iam, region, service).
As keys are distributed more widely they have less and less scope
At IAM scale run into problems with distributing and updating long term key derivatives
Solution is STS. Short lived token that includes all state needed for authentication. Nothing stored on AWS side when STS issues a token. No stored state needs to be touched to validate token.

Zero-privilege operations: Running services without access to data

Describes how AWS works internally to restrict access to customer data.

Shift in industry to pervasive encryption with fine grained permissions and access limited to only what is needed.
Ideally support and admin personnel have zero access to customer data
At AWS, customers have to grant explicit access via normal mechanisms for support
EC2 Nitro - developers and operators have no access to production EC2 Nitro Instance memory
Consider customer data to be radioactive - never want to see or touch it directly
Data hosting: strong isolation, defense in depth
Least privilege: grant access only when needed, only while it is needed, prove that it is needed
Integrate access auditing into all operational practices - always on accountability
All network traffic within a between AWS data centers is encrypted with AES256 or equivalent. Always on.
All storage systems support encryption at rest (guard against physical theft)
No anonymous shared accounts - use IAM roles instead
Turn on CloudTrail!
Internal service to service calls also use IAM and SigV4
All policies used for internal service to service calls are published
All changes run through IAM Access Analyzer (prove you need access)
Access should be highly specific and conditional. IAM “forward access sessions” - on “behalf of access” requires proof that customer has recently called service.
Contigent authorization - rather than granting user admin access, give them access to run just specific tools.
Hermetic systems - e.g. Nitro, KMS. Operators have no way of accessing. Root of trust down to hardware.
Nitro has own cpu/storage with no access to general instance memory
Instances never share cores or cache lines
Nitro enclaves allow customers to create their own isolated sub-instance
AWS Nitro Whitepaper just released
Long term research into cryptographic computation

Optimizing performance with CloudFront: Every millisecond matters

Quick intro to CloudFront and summary of new CF features wrapped around a rather garbled customer testimonial. One to skip.

DDOS attacks handled by AWS Shield - 10K+ attacks per month, 99%+ automatically mitigated
Customer MapBox talks about their journey
- 8+ AWS regions active-active behind CloudFront
- Route 53 location aware latency based routing with failover
- Use cache policies to make caching more specific/effective
- Use ETags
- Resumable byte range requests for large files
- Offload work on origin for cache misses with origin shield (per region cache)
- Enable TLS 1.3, faster and more secure than TLS 1.2.
- Content compression with Brotli. Faster to compress/decompress, 25% smaller. Setup so client gets Brotli if they support, otherwise gzip.
- Use Lambda@edge to move more logic to the edge
- Heavy use of analytics derived from CloudFront logs
  - Key metrics: cache utilization, latency, errors, traffic alerts
- 90% of traffic offloaded to CloudFront
CloudFront functions (JavaScript) at edge locations, Lambda@edge at regional locations
CloudFront functions architecture - process isolation within pool of worker instances. JS runtime restricted to prevent external access or other unsafe operations
- Example scenarios: URL redirects, auth token validation
New features
- Server timing headers - W3C standard supported by most browsers. CloudFront adds server timing headers for CF operations.
- JA3 fingerprinting. Way of identifying caller based on SSL handshake. Can be used to identify threat actors or unusual access patterns. CF can add JA3 fingerprint header to incoming request.
- CF continuous deployment. Can test new distribution by including specific header, then start moving percentage of traffic to new distribution. Roll back at any time.
- http3 (udp based) supported on CF with fallback to http2 and http1.

Create real-time, event-driven apps w/Amazon EventBridge & AWS AppSync

Focus on using AppSync subscriptions for “real-time” notifications to clients.

Intro to GraphQL
Intro to AppSync
Focus on real-time apps
- Publish-Subscribe async pattern
- As it happens
- Push not polling
- Milliseconds, not seconds
Real-time GraphQL subscriptions
- Implemented using WebSocket connect to AppSync
- Relies on mutations happening through AppSync. AppSync applies mutation and then pushes out events.
- Can support out of band request - e.g. direct modification of DynamoDB. You are responsible for reporting the mutation to AppSync (perhaps using lambda with DynamoDB streams) using a “local resolver” so that AppSync doesn’t try to update the data source itself and just sends out events.
- Metrics sent to CloudWatch
Filtering of events
- Basic filtering - 5 arguments, simple operators, specified by client in subscription
- Enhanced filtering - more operators, defined in resolvers, JSON format
Subscription invalidation (filter to decide when client should be disconnected)
- e.g. Disconnect client when a delete conversation mutation is received by a chat app
EventBridge + AppSync
- EventBridge API destination to connect to AppSync
- Event turns into AppSync mutation - data persistence is optional
- Fan out event to multiple targets
- Rich set of rules and filtering capabilities on both sides
10 minute live demo

Discover Cloudscape, an open-source design system for the cloud

AWS has open sourced its design system and React component library. Now you can build an app with the same look and feel as the AWS console.

From 2017 onwards AWS has been trying to standardize on one design system and set of components for AWS console. Now over 90% of console is using CloudScape.
Open sourced in July 2022
Usual design system spec as set of design tokens and guidance
Spacing and margins based on rule of 4
68 React Typescript components
https://cloudscape.design/
27 demos showing how everything fits together in an app
Theming via design tokens
Light and Dark modes
Density modes - comfortable and compact modes
Responsive UI
Initial use cases for third parties
- Product that extends AWS management console
- UI for hybrid cloud management system
- Data intensive interfaces in general

Sustainability in the cloud with Rust and AWS Graviton

Focus on how choice of language stack impacts the resources needed at runtime and hence sustainability.

Fascinating table on energy usage (CPU efficiency) and memory use for common language stacks.

Language	Energy Usage	Memory Usage
C#	Low	Medium
Go	Medium	Low
Java	Low	Medium
NodeJS	Low	High
Python	High	Low
Ruby	Medium	Medium
Rust	Low	Low

Presenters didn’t want to talk about how the data was derived.

Deep dive on Rust vs Java
Static compiled vs JVM
Stack allocation by default vs heap allocation for most things
Monomorphic by default vs dynamic dispatch
Rust focus on zero-cost abstractions
Rust downsides
- Heap allocation not cheap or free
- Need new programming style to deal with ownership
- No runtime reflection
- Very limited dynamic library loading
Experiment getting Rust and Java teams to build same app in each language and compare
- Tests lots of common frameworks for both languages
- Tested Java with JVM and Graal AOT compiler
Initial result - all the Rust versions were much faster, up to 3X
Java developer rewrote app to use native IO and got Graal compiled version on a par with fastest Rust version
Still huge difference in amount of memory used - 15MB for Rust, 64MB for Graal Java, 240MB for JVM Java
Some surprises - reminder to benchmark and measure
Ran another experiment using 128MB Lambda on Graviton comparing warm start performance
- Standard sample available in multiple languages (compare vs Java JVM)
- P50 latency 2X for Java, P99 similar (limited by database latency)
- Graviton latency slightly worse than x86 at P50, going to 1.5X worse at P99
- Rust and Graviton both use less energy
- Java cost 2X that of Rust, Graviton cheaper than x86

What’s new for frontend web and mobile developers with AWS Amplify

Ever wondered what that Amplify thing is all about? This explains it.

Amplify makes it easy for front end developers to build a full stack app
Amplify Studio, CLI, Hosting, Libraries
Long demo of building an app using Amplify
Amplify provisions backend resources you need
Model data using GraphQL
Add annotations to the schema to define auth rules and relationships between entities
Use CLI to provision backend and auth infrastructure
Also sets up local project for all your code. Use CLI to push changes to backend.
Deploy includes GraphQL endpoint based on your schema
New: GraphQL in Amplify Studio UI, real-time data (filtering, selective sync), in-app messages (push notifications)
Amplify backend somewhere between backend as a service and self managed infrastructure. The deployed infrastructure is accessible by you and can be extended.
Three options for building front end
- Generate react components from Figma designs
- New: Studio Form Builder UI
- Use supplied UI components
Support for React, Next.js, Flutter and many more
Amplify hosting for static/SPA web apps and now hybrid server-side rendered too
New: Connect to backend libraries
- Flutter: Rewritten using Dart so your app ends up with single Dart codebase
- iOS: Rewritten using Swift. Supports async/await.
- Android: Rewritten using Kotlin.
- Authenticator component now available for Flutter and React Native
Provisioned Architecture
- AWS Amplify Hosting with content served from S3 bucket via CloudFront
- Amazon Cognito for login
- AWS AppSync GraphQL API with notifications via server side filtering
- DynamoDB database
- Lambda glue logic + your server side logic

Building a multi-region serverless application with AWS AppSync

Very focused talk on the main implementation choices when building a multi-region AppSync app.

Base architecture is standard AppSync setup deployed to two regions with replication between the data stores.
Two choices for how to build single gateway endpoint
- Route 53 routing + API Gateway. Doesn’t support subscription.
- Route 53 routing + CloudFront with Lambda@Edge. Lambda resolves DNS to determine region and then updates request headers to specify origin to use.
Replication choices
- Active-Passive with failover when needed
- Active-Active with DynamoDB global tables
Subscription
- Without extra work subscription is limited to subscribers in the same region where the mutation happened
- Need to trigger subscriptions by using DynamoDB streams to trigger mutation with local resolver. Works for both direct and replicated changes to DynamoDB.
Real-time messaging with WebSockets pub/sub
- Publish message to EventBridge
- Use EventBridge cross region routing to forward to other region
- Event triggers a local resolver mutation in both regions
Managing region failures
- Route 53 application recovery controller (see earlier talk)
API Authorization
- API keys and cognito user pools are single region
- IAM and lambda authorizer will work cross region

Accelerate GraphQL API app development & collaboration w/AWS AppSync

Focus on what’s new in AppSync

Merged APIs. Build time generation of a merged read only (queries and subscriptions) API from multiple independently managed source APIs. Intended for multi-team environments where each team maintains ownership of their own GraphQL API but you also want to expose a single combined graph.
NodeJs resolvers. Rather than having to use VTL you can now write resolver functions using JavaScript/TypeScript. Uses custom AppSync JS runtime. Like VTL focus is on transformation of request/response payload. AppSync still talks to data source direct. No additional charge.

What’s new with Amazon DocumentDB

Focus on what’s new in DocumentDB

New regions and instance types supported
Query capabilities: Geospatial indices, aggregation operators, Decimal128 type
Management features: storage volumes now sized down when you delete data, fast DB cloning, query auditing with DML, performance insights support
Elastic Clusters!
- Previously topped out at 24xl instance type and 64TB storage
- Elastic cluster will support millions of read/write per second, petabytes of storage, up to 300K connections
- Compatible with MongoDB APIs for sharding
- Managed sharding - scaling operations take minutes
- Each shard is a regular DocumentDB cluster deployed across multiple AZs
- Routing layer that routes request using hash sharding
- Critical that you choose a good shard key