Designing and Implementing a Data Science Solution on Azure: Feeding Azure ML with Cosmos DB Data

Cheapest and Most Effective Way to Feed Azure ML with Cosmos DB Data

Question

Your company is storing hundreds of GBs of data in a distributed Cosmos DB.

This huge amount of data contains tons of valuable information about sales transactions and the company is going to make use of it by running machine learning models against it.

Your task is to design how to feed Azure ML processes with Cosmos DB data.

Which option should you choose to get the data to Azure ML in the cheapest and most effective way?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: A.

Option A is CORRECT because If you need to ingest data from Cosmos DB, the cheapest and most powerful way is transferring it to Blob Storage (e.g.

by using Data Factory) and register it as a datastore.

Option B is incorrect because Cosmos DB currently is not supported as a datastore.

Option C is incorrect because Cosmos DB currently is not supported as a datastore.

(If it would be, it should be registered as a datastore.)

Option D is incorrect because Cosmos DB is a no-SQL data storage.

Transferring data to a structured format would be resource intensive.

In addition, SQL Database is not the most cost effective data storage compared to Blob Storage.

Reference:

To feed Azure ML processes with data from a distributed Cosmos DB, there are different options available, including:

A. Transfer data to Azure Blob Storage and register Blob Storage as a datastore: This option involves moving the data from Cosmos DB to Azure Blob Storage before registering it as a datastore. The advantage of this approach is that it provides a low-cost and scalable storage option for large amounts of data. However, it also requires an additional step of data transfer, which can add complexity and potential data inconsistency.

B. Register Cosmos DB as a data store: This option involves registering Cosmos DB as a datastore directly, which eliminates the need for data transfer. However, it may not be the most cost-effective option, as Cosmos DB can be expensive for large datasets.

C. Register Cosmos DB as a dataset: This option involves registering the Cosmos DB data as a dataset in Azure ML directly. This approach allows the data to be accessed directly from Cosmos DB without the need for data transfer or additional storage costs. However, it may not be the most efficient option for large datasets.

D. Transfer data to Azure SQL Database and register it as a datastore: This option involves transferring the data from Cosmos DB to Azure SQL Database before registering it as a datastore. This approach can provide a cost-effective and scalable storage option, but it also requires an additional step of data transfer, which can add complexity and potential data inconsistency.

Considering the options above, the most cost-effective and efficient option to feed Azure ML processes with data from Cosmos DB would be to register Cosmos DB as a datastore (option B). This approach eliminates the need for data transfer and additional storage costs and allows for direct access to the data from Azure ML. However, it is important to consider the size and complexity of the dataset when making this decision, as well as the costs associated with Cosmos DB storage.