ResourceExhaustedError: Out Of Memory - Troubleshooting and Solutions

How to Handle ResourceExhaustedError in GPU-powered Computer Vision Model Training

Question

You need to train a computer vision model that predicts the type of government ID present in a given image using a GPU-powered virtual machine on Compute Engine.

You use the following parameters: -> Optimizer: SGD -> Image shape = 224ֳ-224 -> Batch size = 64 -> Epochs = 10 -> Verbose =2 During training you encounter the following error: ResourceExhaustedError: Out Of Memory (OOM) when allocating tensor.

What should you do?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

B.

https://github.com/tensorflow/tensorflow/issues/136

The error message "ResourceExhaustedError: Out Of Memory (OOM) when allocating tensor" indicates that the GPU memory allocated to the virtual machine has been exhausted during training. This can happen when the model and/or the training parameters require more memory than the available resources on the GPU.

To solve this issue, one can try the following options:

A. Change the optimizer: While changing the optimizer can be beneficial in improving the model's training, it is unlikely to solve the OOM issue. Different optimizers typically do not have a significant impact on memory usage during training.

B. Reduce the batch size: Reducing the batch size can significantly reduce the memory required during training. This is because a smaller batch size means that fewer images are processed at once, reducing the amount of memory needed to store intermediate computations. Therefore, reducing the batch size is a good option to address the OOM error.

C. Change the learning rate: Changing the learning rate can affect the speed of convergence and the accuracy of the model but it is not likely to solve the OOM issue.

D. Reduce the image shape: Reducing the image shape can also reduce the amount of memory required during training. However, it can have a negative impact on the accuracy of the model as it reduces the amount of information available for the model to learn from.

Therefore, the best option to solve the OOM error in this case is to reduce the batch size. This will reduce the memory requirements during training while still allowing the model to learn from a sufficient amount of data.