In AWS Database Cheat Sheet post, we are learning what are database technologies in AWS, how it is different and where it can be used. This post will cover below topics:
- RDS, Redshift, Aurora
- Athena, Elasticsearch, Neptune
- DocumentDB, Keyspaces, Timestream
- Quantum Ledger Database (QLDB)
- RDS SQL, Relational
- Aurora SQL, Relational, Cloud optimized, serverless
- Elastic Cache Redis & mem-cached, in-memory database
- AWS Redshift OLAP -Online Analytical Processing, Data warehouse, columnar storage, massively parallel query execution
- Document DB MangoDB
- Amazon Athena Query data in S3
- Amazon Quicksight BI analytics, ML powered, service to create interactive dashboards.
- Amazon EMR Elastic Map Reduce, Hadoop cluster, HBase, Spark
- Amazon Glue Extract, transform and Load *ETL)
- Amazon Neptune Graph, facebook
- Dynamo DB serverless, NoSQL,
- Amazon QLDB financial ledger central authority
- Amazon managed Blockchain without central authority, crypto
- Amazon Keyspaces Apache cassendra
- AWS Time stream Time series
- AWS Open search (AWS Elastic search ) searching data
AWS RDS Cheat Sheet
- Relational Database service, SQL, 1) MYSQL 2) Postgres 3) MariaDB 4) Oracle 5)Aurora
- RDS is managed service.
- Automated provision of OS patching, continuous backup & point-in-time recovery.
- Read replicas for improved read performance
- Multi-AZ for disaster recovery.
- Scaling capacity- Horizontal & Vertical
- Storage is backed by EBS (Elastic Block Store)
- You can’t SSH into your RDS instances & has monitoring dashboard.
RDS – Storage Auto scaling: When RDS detects that you are running out of free database storage, it scales automatically. Must set maximum storage threshold. Useful for unpredictable workload.
RDS Read Replicas: Upto 5 read replicas, multi AZ, within AZ or cross region, Replication is ASYNC so events are eventually consistent. Read replicas within same region = free, no charge, Read replicas cross region= Fees $
RDS Multi-AZ for Disaster Recovery: SYNC Replication, Increases Availability, ONE DNS -automatic app failure to standby. Multi AZ replication is free but read replicas must be setup as multi-AZ for DR.
RDS CUSTOM:Managed Oracle & Microsoft SQL servers database with OS and database customizations. Can configure settings, install patches and access underlying EC2 instances. Need to deactivate automation mode to perform customizations.
RDS Backups: 1) Automated Backup– Daily full backup of DB during maintenance window. 2) Manual DB Snapshot: Manually triggered by user.
RDS Proxy: Fully managed database proxy for RDS. Allows apps to pool and share database connections with database. Improves database efficiency by reducing stress on database. Serverless, autoscaling, high availability. Reduces RDS & Aurora failover time by 66%. Supports RDS – Postgres, MySQL & MariaDB. Supports Aurora – Postgres and MySQL. RDS proxy is never publicaly accessible, must be accessed from VPC.
Amazon Aurora Cheat Sheet
- Relational Database, SQL
- AWS proprietary database
- Postgres and MySQL supported by Aurora Database.
- Aurora is AWS Cloud Optimized
- 5x performance improvement over MySQL in RDS
- 3x performance improvement over Postgres in RDS.
- Aurora cost more than RDS (20%) but it is more efficient.
- Aurora storage grows automatically in increment of 10GB upto 128 TB.
- Aurora – 15 read replicas and replication is faster (10ms replication lag)
- Failure in Aurora is instantaneous.
- Aurora is HA (High availability) native, has self healing capacity.
Aurora Features: 1) Automatic Failover 2) Backup & Recover 3) Isolation & Security 4) Industry compliance 5) Push button scaling 6) Automated patching with zero downtime 7) Advanced monitoring 8) Routine Maintenance 9) Backtrack – restoring data at any time without using backup.
Aurora HA & Read Replicas: One Aurora instance takes writer called master. Automatic failure for master in less than 30 sec. Cross region replication support. 1 writer master + 15 read replicas. 6 copies of your data across 3 AZ.
Aurora Multi master: Multiple writer master, Immediate failure for writer node, every node does R/W.
Aurora Read Replicas -Auto scaling: can be set as per demand like based on CPU utilization.
Aurora Custom Endpoints: custom endpoints. Ex. To run analytical queries on specific replicas.
Aurora Serverless: Good for infrequent, intermittent and unpredictable workload. No capacity planning is needed. Automatic database instantiation and auto scaling based on actual usage. Pay per second, can be more cost effective.
Global Aurora: 1) Aurora cross region read replicas for DR. 2) Aurora Global Database: 1 primary region for R/W. Upto 5 secondary read only regions. Replication lag is less than 1 sec. Upto 16 read replicas per secondary region. Helps in decreasing latency. Promoting another regio n for DR has RTO < 1 min.
Aurora Backups: Automated and manual backups same as RDS.
Aurora Database cloning: create new Aurora DB cluster from existing one and provides sane volume and data as original. Faster than backup and restore. Very fast and cost-effective.
- In-memory database with really high performance and low latency.
- Managed Redis or Memcached
- Helps reduce the load of database for read incentive workload.
- Helps make your application stateless.
- AWS takes care of – OS maintenance, patching, setup, configuration, monitoring, failure recovery & backup.
- It requires heavy application code changes.
ElastiCache – Redis vs Memcached:
Redis – Multi AZ with auto failover. Read replicas to scale read & have HA. Data durability using AOF persistence. Backup and restore feature. Memcached – Multi-node for partitioning of data (shrading). No High-availability. Non persistence. No backup and restore. Multi-threaded Architecture.
Elastic Cache Security:
- All caches in ElastiCache: Do not support IAM authentication. IAM policies on elastic cache used only for AWS API level security.
- Redis Auth: can set “password/token” when creating Redis cluster. This is extra level of security for cache.
- Memcached: supports SAML-based authentication
Elastic Cache -Use Cases: Gaming leader boards uses Redis sorted sets which guarantees both uniqueness and element ordering. Each time new element is added, its ranked in real-time then added in correct order.
Patterns for ElastiCache:
Lazy Loading: All the data is cached and can become stale in cache.
Write Through: Adds or updates data in cache when written to database. (No stale data)
Session Store: stores temporarily. Session data in cache using TTL feature.
- Fully managed, highly available with replication across multiple AZ
- NoSQL database with transactional support.
- Scales to massive workload and distribution database.
- Millions of requests per second, trillions.
- Fast & consistent in performance. Single digit millisecond latency.
- Low cost & auto scaling capabilities
- IAM integration – authentication and authorization
- No maintenance, patching.
- Standard and Infrequent Access (IA) table class.
AWS DynamoDB Basics:
- DynamoDB is made of tables.
- Each table has primary key.
- Each table can have infinite number of items (rows).
- Each item has attributes.
- Maximum size of an item is 400KB.
- Datatypes are – scaler type, document data type and set type.
- DynamoDB has flexible schema.
DynamoDB Read/Write Capacity Modes:
- Provisioned Mode (default): The capacity planning needed. Need to specify number of read/writes per second. Pay for provisioned read capacity unit (RCU) and write capacity unit (WCU). Good for predictable workload.
- On-demand Mode: Read/Write automatically scales up/down with workload. No capacity planning needed. Pay for what you use and more expensive. Good for unpredictable workload and sudden spikes.
DynamoDB Accelerator (DAX): In-memory solution for DynamoDB only. Fully managed, highly available, seamless in-memory cache for DynamoDB. Helps to solve read congestion by caching. Microsecond latency for cached data. 5 minutes TTL by default. Does not require any application logic modification.
DynamoDB – Time To Live (TTL): Automatically deletes items after an expiry timestamp.
DynamoDB Global Tables: make a DynamoDB table accessible with low latency in the multiple region. Active-Active replication. Application can read and write to table in any region. Must enable DynamoDB streams as pre-requisites.
DynamoDB Stream Processing: Ordered streams of item level modification (create/update/delete) in a table. React to changes in real-time like welcome email to new users.
DynamoDB Backups: 1) Continuous backup using point-in-time recovery. The recovery process creates new table. 2) on-demand backup: full backup for long term retention. Can be configured and managed using AWS config.
DynamoDB Integration with S3:
Export to S3: Must enable PITR. Works for any point of time for 35 days. Retain snapshot for auditing. ETL on top of S3 database before importing back into DynamoDB.
Import to S3: Import CSV, JSON, ION format. Creates new table. Import errors logged into cloud watch logs.
- Redshift is based on postgres but not used ofr OLTP.
- It is used for OLAP (online analytical processing ) & data ware housing.
- 10x better performance than other data ware houses.
- Scales to PBs of data.
- Columnar storage data.
- Parallel query engine.
- Pay as you go
- Integration with Amazon QuickSight and tableau.
- Vs AWS Athena – Faster queries, joins aggregations thanks to indexes.
- Redshift Cluster: 1) Leader node: for query planning and the result aggregation. 2) Compute node: for performing the queries and sends data to leader.
Redshift – Snapshot:
- Has no “multi-AZ” mode
- Snapshots are PITR backups.
- Snapshots are incremental.
- Can restore snapshot into new cluster.
- Automated – Every 8 hours, every 5 GB or on schedule. Retentions between 1 to 35 days.
- Manual – retention until you delete it.
Redshift DR: You can configure Redshift to automatically copy snapshot of cluster to another AWS Region.
Loading data into Redshift: 1) Amazon Kinesis data firehouse into Redshift. 2) S3 using COPY command. 3) EC2 instance JDBC driver.
Amazon Redshift Spectrum: Query data that is already in S3 without loading it. Must have redshift cluster available to start the query. The query is then submitted to thousands of Redshift spectrum nodes.
- Document DB is “AWS implementation” of MongoDB which is NoSQL.
- MongoDB is used to store, query and index JSON data.
- Similar deployment as Amazon Aurora.
- Fully managed, highly available with replication across 3 AZ.
- Storage grows automatically in increment of 10GB upto 64TB.
- Automatically scales workload with millions of request per second.
- Fully managed “Graph database”.
- Social media – Facebook
- Highly available across 3 AZ.
- Upto 15 read replicas.
- Can store upto billions of relations and query the graph withing milliseconds.
- Wikipedia, fraud detection and social networking.
- For Apache Cassendra which is an open sources NoSQL database.
- A managed Apache Cassendra database service.
- Serverless, scalable, highly available and fully managed by AWS.
- Tables are replicated 3 times across AZs.
- Automatically scales table up/down based on traffic.
- Single digit millisecond latency. 1000s of requests per second.
- Capacity – on-demand or provisioned with auto scaling
- Using Cassendra Query Language (CQL).
- Encryption, backup PITR upto 35 days.
- Use cases: Store IoT devices info, time series data
- Fully Managed, fast, scalable, serverless, time series database
- Automatically scales up/down to adjust
- Store and analyze trillions of events per day.
- 1000s times faster and 1/10th the cost of relational database.
- Data storage tiering -1) recent data are kept in memory. 2) historic data kept in cost-optimized storage.
- Built-in time series analytics functions helps in identifying pattern in your data in real time.
- IoT app, Operational application real tine analytics.
- With central authority.
- OLDB – Quantum Ledger DB
- Ledger -> book recording financial transaction.
- Full managed, highly available, replication across AZ.
- Used to review history of all the changes made to your application data over time.
- Immutable System ie. No entry can not be updated or removed.
- Cryptographically verified
- 2-3x better performance than common ledger
- Amazon Managed Blockchain
- Without central authority.
- Managed blockchain service to join public network or create your own scalable private network.
- Cryptography – Hyper ledger and Etherum.
- Serverless query service to analyze data stored in Amazon S3.
- Uses SQL to query & build on presto.
- Supports – CSV, JSON, ORC, Parquet.
- Commonly used with Amazon QuickSight for reporting and dashboard.
- Use cases: Business Intelligence, analytics, reporting, analyze & query VPC flow log, ELB logs, cloud trail etc.
- Performance Improvement: 1) use columnar data for cost saving – Apache parquet, ORC is recommended. 2) Compress data for smaller retrieval. 3) Partition datasets in S3 for easy querying. 4) Use larger files to minimize overhead.
- Elastic Map Reduce
- Helps creating Hadoop cluster (Big Data) to analyze and process vast amount of data.
- Apache spark, HBase, Prestro, Flink.
- EMR takes care of all the provisioning and configurations.
- Auto scaling and integration with spot instances.
- Use Cases: data processing, Machine Learning, Web indexing and big data.
Master Node: Manages the cluster co-ordinate, manage health and long running.
Core Node: Run tasks and store data, long running
Task Node (Optional): Just to run tasks. Usually spot instances.
EMR Purchasing Options:
On-Demand – reliable, predictable
Reserved (min 1 yr) – cost sabing
Spot Instances: Cheaper, can be terminated, less reliable.
Amazon Quick Sight:
Serverless, machine learning powered business intelligence service to create interactive dashboards.
Fast, automatically scalable, embeddable per session pricing.
- Business Analytics
- Building Visualization
- Perform ad-hoc analytics
- Get business insight using data.
- Integrated with RDS, Aurora, Athena, Redshift, S3 …
- In-memory computations using SPICE engine if data is imported into Quick Sight.
ETL – Extract, Transform & Load
Useful to prepare and transform data for analysis.
Fully managed, serverless service.
1) Convert data into Parquet format
2) Glue Data Catalog – Catalog of databases, uses AWS Glue data crawler.
Glue Job Bookmarks: Prevent to reprocessing of data.
Glue Elastic view: Combine and replicate data cross multiple data store using SQL
Leverages virtual table, No custom code, serverless
Glue Data Brew: Clean and normalize data using pre-build transformations.
Glue Studio: new GUI to create, run and monitor ETL jobs in Glue.
Glue Streaming ETL: Build on Apache Spark structured programming. Compatible with Kinesis data stream, MSK, and kafka.
Amazon Open Search:
- Successor to Amazon Elastic Search.
- With Open search , you can search any field and even partially matches.
- Open Search requires cluster of instances i.e. not serverless
- Does not support SQL. It has it’s own language.
- Security through IAM, Cognito, KMS, encryption and TLS
- Comes with Open Search dashboard.
- Ingestion from 1) Kinesis Data Firehouse 2) AWS IoT 3) Cloud watch logs.