On November 30th, I had the privilege of (virtually) attending Swami Sivasubramanian’s keynote address at re:Invent 2022. Swami is the Vice President of Data and Machine Learning at AWS, whose team’s mission is, “To make it easy for organizations and developers by providing the best set of capabilities to store and query data (to build scalable data driven apps), analyze and visualize their data (to do analytics) and put their data to work through machine learning.” This space is always ripe for innovation and new service releases, so let’s see what this year has in store!
Swami began his keynote by introducing the three core elements of an organization’s data strategy:
- Build future-proof foundations supported by core data services
- Weave connective tissue across your organization
- Democratize data with tools and education
Each of today’s announcements can be viewed through the lens of one of these core elements, which are increasingly important to all modern organizations. In fact, Swami revealed some incredible numbers around the AWS customer base:
- More than 1.5 million AWS customers use database, analytics, or machine learning AWS services.
- Among the top 1000 AWS customers, 94% of them use 10 or more different database and analytics services!
With that in mind, it’s fitting that the first announcement would center around one of the most popular analytics services, Amazon Athena!
Amazon Athena for Apache Spark
For today’s first major announcement, Swami talked at length about the popularity of Amazon Athena, a serverless service that makes it easy to analyze data in Amazon S3 using traditional SQL. While Athena is great for simple analysis, customers have increasingly needed to adopt Apache Spark to build more complex, distributed data analytics applications. Unfortunately, this complexity requires customers to provision and maintain the infrastructure needed to run Apache Spark.
However, taking on the burden of managing Spark infrastructure is no longer necessary with the announcement of Amazon Athena for Apache Spark! This new, completely serverless offering is now generally available and will allow users to run interactive analytics workloads in Apache Spark. Best of all, it’s completely serverless and can be spun up in under one second! Its performance is up to 75 times faster than other serverless Spark offerings. Customers can now build robust Spark applications directly from Athena without any of the headaches typically associated with maintaining a Spark cluster. Pricing for this service will be based on the amount of compute usage (defined by the data processing unit, or DPU) per hour.
Amazon DocumentDB Elastic Clusters
Swami spoke about the pain points many organizations experience when scaling write operations beyond a single DocumentDB instance. Complex sharding logic is difficult and time-consuming to implement, and customers have been clamoring for something of an “easy button” when it comes to scaling read and write operations. For these customers, the announcement of Amazon DocumentDB Elastic Clusters will come as very welcome news!
DocumentDB Elastic Clusters allow databases to elastically scale up to millions of operations in just minutes, with zero impact on application availability or performance. All of the sharding logic is handled automatically on your behalf, with each shard having its own associated compute and storage resources, all of which are managed automatically by AWS. Elastic Clusters are highly available by default and allow your workloads to scale up to millions of read/write operations per second and up to petabytes worth of storage.
Geospatial ML Support in Amazon SageMaker
One of the more exciting new announcements today was the preview release of new Geospatial ML support for Amazon SageMaker. Many organizations face challenges when it comes to finding high quality geospatial data that they can use to train machine learning models. These challenges often extend beyond simply finding this data to being able to import this data, which is usually quite large, then visualize it using tools that are typically somewhat feature-limited.
This release aims to address all of these pain points. SageMaker now features interactive mapping capabilities with robust geospatial data now readily available. In a very compelling demo of this new release, we saw how tools such as a pre-trained road extraction model can be used in conjunction with POI data from Foursquare to assist first responders and aid workers looking to find passable roads after a significant flooding event.
Amazon Redshift Multi-AZ
Next, Swami announced an important feature update for Amazon Redshift: Multi-AZ support! This support allows you to make your Redshift data warehouse deployments highly available, providing you with guaranteed capacity to automatically failover in the event of an outage in your data warehouse’s primary availability zone at a fraction of the cost of maintaining separate standby instances. Best of all, no changes to your applications are required. This update is now available in preview and represents an important step in the journey to fully protect an organization’s data from core to perimeter.
Trusted Language Extensions for PostgreSQL
Swami talked about PostgreSQL and how it’s the fastest-growing database platform on AWS, both on Amazon RDS and Amazon Aurora. In particular, database developers love PostgreSQL because of its extensibility model that allows you to build extensions using popular programming languages such as JavaScript and Perl. In fact, AWS already supports several dozen PostgreSQL extensions in both RDS and Aurora.
To further demonstrate AWS’ commitment to both open source and PostgreSQL extensions, Swami announced a brand new open-source project to support PostgreSQL extensions: Trusted Language Extensions for PostgreSQL. This new open source project is licensed under the Apache License 2.0 and will allow developers to safely use and install PostgreSQL extensions on both RDS and Aurora using popular programming languages right away, without needing to wait for AWS to certify them first.
Amazon GuardDuty RDS Protection
Speaking of Aurora, the next announcement gives Aurora an important security boost: Amazon GuardDuty RDS Protection! This update further enhances the robust, intelligent threat detection capabilities of GuardDuty by allowing you to identify suspicious or malicious activity within your Aurora databases.
Available in preview today, GuardDuty RDS Protection will profile and monitor all access to your Aurora databases and when findings are identified, can be issued to the following:
- GuardDuty console
- AWS Security Hub
- Amazon Detective
- Amazon EventBridge
This allows you to seamlessly integrate RDS protection into your existing security applications and workflows.
AWS Glue Data Quality
In addition to data security and availability, it’s critically important that the data in an enterprise’s data lake meet certain data quality standards, as the quality of any data-driven decisions will always directly correlate with the quality of the data itself. To support this, AWS Glue Data Quality offers a new (preview) set of features for AWS Glue that can automatically assess your tables and generate data quality rules, enabling better decision making and more importantly, significantly reducing the amount of effort required to ensure your data lakes and data warehouses are filled with quality data.
These data quality rules can ensure everything from the presence of data or required length of data within a particular column, to valid date or other identifier ranges and much more, promising to reduce manual efforts from days down to hours!
Centralized Access Controls for Redshift Data Sharing
Turns out we aren’t done with enhancements to Redshift just yet! Swami went on to discuss the importance of an end-to-end governance strategy when it comes to an organization’s critical data lakes, data warehouses, and machine learning. To that end, he referenced the 2021 release of row and cell-level permissions within AWS Lake Formation, which then segued into the next announcement: Centralized Access Controls for Redshift Data Sharing.
This feature update, which is now available in preview, allows you to centrally manage access controls for Redshift data using AWS Lake Formation without requiring complicated queries or manual, labor-intensive scripting. This simplified governance makes it easy to manage access and security for your Redshift data down to the individual row and column level.
Amazon SageMaker ML Governance
Governance continues to be an important theme in today’s next release: Amazon SageMaker ML Governance. Much like the centralized access controls for Redshift data sharing, this enhanced suite of governance tools for SageMaker will increase transparency and simplify access controls across your organization’s ML lifecycle through the following:
- Role Manager allows you to define custom permissions for SageMaker users.
- Model Cards allow you to simplify documentation of your ML models throughout their lifecycle.
- Model Dashboard allows you to have a single unified view of all your ML models in a single location.
Together, these tools will provide much more robust governance and auditability across your ML development lifecycle.
Amazon Redshift auto-copy from S3
We’ve already shown Redshift so much love today (and this week if you include Adam Selipsky’s announcements yesterday around Aurora Zero-ETL integration with Redshift along with Redshift integration for Apache Spark), but we aren’t done yet! Swami announced one additional preview feature: Amazon Redshift auto-copy from S3. This feature promises to drastically simplify the process of loading your data from S3 into Redshift.
Instead of running manual copy statements every time you wish to load data from S3 into Redshift, Amazon Redshift auto-copy from S3 allows you to create Copy Jobs that will continuously and automatically load new objects directly into Redshift as they are added to S3. This allows you to fully automate a simple data ingestion pipeline without requiring any ongoing engineering effort. Very cool stuff!
New Data Connectors for AppFlow and Data Sources for SageMaker Data Wrangler
The next couple of announcements centered around expanding the ability to use services like Amazon AppFlow and Amazon SageMaker Data Wrangler to bring together information from different systems and data stores, including on-premises applications, SaaS applications, and AWS services.
Swami began by announcing 22 new data connectors for Amazon AppFlow. These include marketing connectors such as LinkedIn Ads and Google Ads. He then followed this by announcing over 40 new data sources for Amazon SageMaker Data Wrangler (again including LinkedIn Ads and Google Ads, among many others). Together, these connectors and data sources will help organizations build integrated analytics applications and predictive models that can guide and enhance an organization’s decision-making process.
AWS Machine Learning University now provides educator training
Swami’s final announcement centered around the final core element of the data strategy he introduced at the beginning of his keynote: Democratize data with tools and education. Noting the incredible gap that exists between the 54,000 Computer Science graduates our nation’s colleges and universities produce annually and the estimated 1 million AI and ML jobs that our economy will produce by the year 2029, Swami proudly announced that AWS Machine Learning University (MLU) now provides educator training.
AWS MLU was launched in 2018 as a way to give developers self-service access to the same machine learning training AWS uses internally in an effort to educate the next generation of data developers. This update enhances MLU with “train the trainer” resources for educators, especially those at community colleges, minority-serving institutions, and HBCUs to establish courses, certificates, and degree programs in the areas of data analytics, AI, and ML. Best of all, as part of this effort, faculty and students get free access to instructional materials and a live ML development environment using Amazon SageMaker Studio Lab. This exciting announcement promises to help close the future skills gap as well as provide important diversity in the growing field of data science.
In Closing
I entered today’s keynote full of excitement and anticipation, and Swami did not disappoint. I’ve been thoroughly impressed by the breadth and depth of announcements and new service releases already this week, and it’s only Wednesday! Keep an eye on our blog for more exciting keynote announcements from re:Invent!