Find BigQuery Datasets with Employee SSN Column

How to Identify Datasets with Employee SSN Column in BigQuery

Question

Your company uses BigQuery for data warehousing.

Over time, many different business units in your company have created 1000+ datasets across hundreds of projects.

Your CIO wants you to examine all datasets to find tables that contain an employee_ssn column.

You want to minimize effort in performing this task.

What should you do?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F.

D.

The most efficient approach to finding tables that contain an employee_ssn column is to use option C: Write a script that loops through all the projects in your organization and runs a query on INFORMATION_SCHEMA.COLUMNS view to find the employee_ssn column. Here's why:

Option A: Going to Data Catalog and searching for employee_ssn in the search box is not an ideal solution because it will only return datasets that have been registered with Data Catalog, and not all datasets in the organization may have been registered. Additionally, searching for the column name may return false positives or miss some datasets.

Option B: Writing a shell script that uses the bq command line tool to loop through all the projects in the organization may work, but it is time-consuming and requires a lot of manual effort. It would also require querying each table in each dataset, which would be inefficient.

Option C: Writing a script that loops through all the projects in the organization and runs a query on INFORMATION_SCHEMA.COLUMNS view to find the employee_ssn column is the most efficient solution. The INFORMATION_SCHEMA.COLUMNS view contains metadata about columns in all tables in all datasets in the project, including column names, data types, and table names. By querying this view, we can easily find all tables that contain an employee_ssn column across all projects in the organization.

Option D: Cloud Dataflow is a good choice for batch and stream processing, but it is not necessary in this case since we are only looking for specific column names in BigQuery datasets.

Option F: The same as option C, but without writing a script to automate the process. It would require manual effort to query each project's INFORMATION_SCHEMA.COLUMNS view to find the employee_ssn column, which would be inefficient and prone to human error.