AWS Certified Big Data - Specialty: Creating Tables in Amazon Athena for Different File Formats

Create Tables in Amazon Athena for Different File Formats

Question

A company is planning on hosting data sets via files uploaded to S3

Amazon Athena will be used to create tables based on the files in S3

The files will be in csv format and the tables will be created based on the files.

Which of the following needs to be used when creating the tables which works with different formats.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - B.

This is mentioned in the AWS Documentation.

########

Using a SerDe.

A SerDe (Serializer/Deserializer) is a way in which Athena interacts with data in various formats.

It is the SerDe you specify, and not the DDL, that defines the table schema.

In other words, the SerDe can override the DDL configuration that you specify in Athena when you create your table.

To Use a SerDe in Queries.

To use a SerDe when creating a table in Athena, use one of the following methods:

Use DDL statements to describe how to read and write data to the table and do not specify a

ROW FORMAT.

, as in this example.

This omits listing the actual SerDe type and the native

LazySimpleSerDe.

is used by default.

In general, Athena uses the

LazySimpleSerDe.

if you do not specify a

ROW FORMAT.

, or if you specify

ROW FORMAT DELIMITED.

.

ROW FORMAT.

DELIMITED FIELDS TERMINATED BY ','

ESCAPED BY '\\'

COLLECTION ITEMS TERMINATED BY '|'

MAP KEYS TERMINATED BY ':'

########

The other options are more relevant when you start working with other database formats such as DynamoDB or AWS RDS.

For more information on using SerDe, please refer to the below URL.

https://docs.aws.amazon.com/athena/latest/ug/serde-about.html

When creating tables in Amazon Athena, a SerDe (Serializer/Deserializer) needs to be used to specify the format of the data stored in S3. A SerDe is a set of instructions that tells Athena how to parse the data in a particular format so that it can be queried. It is essential to use the correct SerDe for the data stored in S3 because Athena cannot automatically detect the format of the data.

In this scenario, the data sets are stored in csv format, and hence a CSV SerDe should be used. When creating a table, the CSV SerDe can be specified by including the following line in the table creation statement:

sql
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

This line tells Athena to use the OpenCSV SerDe, which is an open-source SerDe for parsing csv files.

Defining a throughput for the table (option A) is not relevant to this scenario because throughput is a concept related to Amazon DynamoDB, not Athena.

Using Indexes (option C) is also not necessary because Athena automatically creates an index for each table based on the column used in the WHERE clause of a query.

Using Primary Keys (option D) is also not applicable to this scenario because Athena is a query service and does not have the concept of primary keys. Primary keys are used in databases to enforce uniqueness and provide a quick way to search for specific records. Athena is a serverless query service that allows you to run ad-hoc SQL queries on data stored in S3 without the need for a database.

Therefore, the correct answer is option B, using a SerDe.