Today, we’re announcing the general availability of data preparation authoring in AWS Glue Studio Visual ETL. This is a new no-code data preparation user experience for business users and data analysts with a spreadsheet-style user interface that runs data integration jobs at scale on AWS Glue for Spark. A new visual data preparation experience makes it easy for data analysts and data scientists to clean and transform data to prepare it for analysis and machine learning (ML). Within this new environment, you can choose from hundreds of prebuilt transformations to automate data preparation tasks, all without writing any code.
Business analysts can now collaborate with data engineers to create data integration tasks. Data engineers can use Glue Studio’s visual flow-based view to define data connections and set the order of the data flow process. Business analysts can use data preparation experience to define data transformation and output. In addition, you can import your existing AWS Glue DataBrew data cleaning and preparation “recipes” into the new AWS Glue data preparation environment. That way, you can continue to build them directly in AWS Glue Studio and then scale the recipes to handle petabytes of data at a lower cost for AWS Glue workloads.
Visual ETL prerequisites (environment setup)
Visual ETL needs an AWSGlueConsoleFullAccess managed IAM policy attached to the users and roles that will access AWS Glue.
This policy grants these users and roles full access to AWS Glue and read access to Amazon Simple Storage Service (Amazon S3) resources.
Advanced visual ETL flows
Once the appropriate AWS Identity and Access Management (IAM) role permissions are defined, create a visual ETL using AWS Glue Studio.
Excerpt
Create an Amazon S3 node by selecting an Amazon S3 node from the list Springs.
Select the newly created node and locate the S3 dataset. After successfully uploading the file, select Draw a diagram to configure the source node and the visual interface will preview the data contained in the .csv file.
I previously created an S3 bucket in the same region as the AWS Glue visual ETL and uploaded a .csv file visual ETL conference data.csv
containing the data that I will visualize.
It is important to set role permissions as detailed in the previous step to grant AWS Glue read access to the S3 block. Failure to do this will result in an error that will ultimately prevent you from previewing the data.
Convert
After configuring the node, add a Data Preparation Recipe and start a data preview session. This session usually takes about 2-3 minutes to start.
Once the data preview session is ready, select Author’s recipe to start an editing session and add transformations after the data frame is complete. During an editing session, you can view data, apply transformation steps, and interactively view transformed data. You can undo, redo, and reorder steps. You can visualize the data type of the column and the statistical properties of each column.
You can start applying transformation steps to your data, such as changing formats from lowercase to uppercase, changing the sort order, and more, by selecting Add a step. All your data preparation steps will be tracked in the recipe.
I wanted to keep track of the conferences that will be held in South Africa, so I created two recipes to filter by the conditions when Lease column has values equal to “South Africa” and Comments column contains a value.
Load
Once you’ve prepared your data interactively, you can share your work with data engineers who can augment it with more advanced visual ETL flows and custom code to seamlessly integrate it into their production data pipelines.
Now available
The AWS Glue data preparation experience is now publicly available in all AWS commercial regions where AWS Data Brew is available. To learn more, visit AWS Glue, watch the following video, and read the AWS Big Data blog.
For more information, visit the AWS Glue Developer Guide and submit feedback on AWS re:Post for AWS Glue or through your usual AWS support contacts.
— Veliswa