TensorFlow is a popular machine learning (ML) framework.
A common use case is to import TensorFlow as a module in Python.
It is then up to you as a user to write your particular ML application as a Python script using the
tensorflow Python module functionality.
We provide precompiled optimised installations of TensorFlow in our tree of software modules.
Just like with most software, search for all available versions with
module spider tensorflow.
If you want to run on CUDA accelerated GPU hardware, make sure to select a version with
It is also possible to run TensorFlow using containers of which we provide many versions already centrally installed.
TensorFlow is heavily optimised for GPU hardware so we recommend using the CUDA version and to run it on the compute nodes equipped with GPUs. How to do this is described in our guide to running jobs.
After loading a
TensorFlow module your environment is now configured to start calling TensorFlow from Python.
Here is a small test that prints the TensorFlow version available in your environment:
[cid@vera ~]$ python -c "import tensorflow as tf; print(tf.__version__)" 2.7.1
If you intend to run your calculations on GPU hardware it can be useful to check that TensorFlow detects the GPU hardware using its
Here is an example from a node equipped with a Nvidia Quadro GPU.
[cid@vera ~]$ python -c "from tensorflow.python.client import device_lib; device_lib.list_local_devices()" ... 2020-06-01 16:08:00.418439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:3b:00.0 name: Quadro P2000 computeCapability: 6.1 coreClock: 1.4805GHz coreCount: 8 deviceMemorySize: 4.94GiB deviceMemoryBandwidth: 130.53GiB/s ... 2020-06-01 16:08:00.434406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0 2020-06-01 16:08:00.434436: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-06-01 16:08:00.926384: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-01 16:08:00.926435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 2020-06-01 16:08:00.926442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N 2020-06-01 16:08:00.927639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/device:GPU:0 with 4454 MB memory) -> physical GPU (device: 0, name: Quadro P2000, pci bus id: 0000:3b:00.0, compute capability: 6.1)
You find our general dataloading tips at datasets.
TensorFlow is a complex software and the stochastic nature of many machine
learning algorithms together with the graph-based functioning of TensorFlow
makes debugging TensorFlow code more challenging than traditional imperative
code. For example, if you have previous debugging experience you are likely
aware that inserting print-statements at a few well-considered points in your
code can be a very efficient way to debug, however in TensorFlow this common
technique will not work in the general case due to the construction and
execution of graphs, which may ignore your print, or only execute it once
during graph initialisation. Even the common approach of attaching an
interactive (or step) debugger, such as
pdb, have also limited
effectiveness in debugging TensorFlow code, due to the aforementioned reasons.
Of course, all hope is not lost - we just have to adjust our tools and
This guide provides recommendations to ease the process of debugging TensorFlow 2 using Python and the bundled tools included in the TensorFlow installation available at C3SE.
- Eager vs. Graph execution
- Printing with
- Increase log severity
- Checkpoint your code
- Visualise using TensorBoard
- Check for
- Experimental API debugging
The examples shown here have not been tested in Jupyter notebooks. The code
should run but some output may not be visible in the Jupyter notebook. For
tf.print outputs to stderr, rather than the notebook. Keeps this in
mind if you are running Jupyter.
Debugging in TensorFlow, experimental APIs, and references to documentation
Debugging is about analysing code that does not behave in an
expected way. Debugging in TensorFlow could be to find numerical issues, for
+- Infinity, finding mathematical modelling errors, as well
as trying to answer more broad questions Why is my prediction wrong? or Why
do the predictions differ between these two models?. This guide is limited
in-scope towards the lower-level details of your TensorFlow code, but debugging
covers a wide range of issues, and the tools and API for one type of debugging
might not fit the next.
TensorFlow is still under intense development and breaking public API changes
may occur between major versions. Debugging and profiling features both enter
and transition from the
tf.experimental namespace. Always note what version
of TensorFlow you are running, and check what new features you get, and what
changes, by upgrading to the latest version.
Are you running on the GPU?
Are you reading this guide because you are unsure if you are using the GPUs? You can add the following line to the top of your program to find out what devices get mapped to your operations.
import tensorflow as tf tf.debugging.set_log_device_placement(True) def square(x): return tf.math.sqrt(x) roots = square(tf.constant([2.0,4.0,8.0]))
If we run this code:
$ python example.py [...] 2021-03-09 09:28:44.485472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 13592 MB memory) -> physical GPU (device: 3, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5) 2021-03-09 09:28:44.486117: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. 2021-03-09 09:28:44.606224: I tensorflow/core/common_runtime/eager/execute.cc:611] Executing op Sqrt in device /job:localhost/replica:0/task:0/device:GPU:3
You will only get one print per eager execution (see further down this article for an explanation of eager vs. graph execution) you will thus not flood your logfiles if you do repeat tf calls.
Running the same example in graph mode using the
tf.function decorator we
also get the graph placement appended to the end of the log.
x: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0 Sqrt: (Sqrt): /job:localhost/replica:0/task:0/device:GPU:3 Identity: (Identity): /job:localhost/replica:0/task:0/device:GPU:3 identity_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:3
As seen above the tensor operations were deployed on GPU 3.
Beware of the optimiser! Optimisation may alter your graphs!
The TensorFlow optimiser Grappler performs optimisations during graph execution which might result in some of your logged operations not being executed.
Read more about Grappler in the official documentation.
If you have discovered, or been notified, that your jobs are not efficient, we recommend you profile your application. Read our profiling guide for tips and documentation of available tools.
Eager vs. graph execution🔗
TensorFlow for most long-running and non-trivial training cases works by
constructing and executing graphs. This makes the well-known debugging
techniques (such as printing or using a step debugger) not very useful as you
are most likely interested in actual values during execution rather than their
symbolic representation, which is what you will see if you use e.g. python
Tensor(“MatMul:0”, shape = (2,2), dtype = float32);
The above symbolic representation of a 2x2 float32 tensor from a
matmul-operation is what a normal Python
tf.Tensor( [ [ 1337.0 1.0] [ 1.0 1337.0] ], shape= (2,2), dtype = floatt32)
So - we conclude that TensorFlow supports two different types of execution: eager and graph. Eager execution is basically operation-by-operation execution, as you would expect from working with any imperative coding language, while graph execution (that may take place in native code) works by constructing (known as tracing) and executing a graph. Eager execution is in general (but not always) slower but easier to debug compared to graph execution, hence the saying code with eager execution, run with graphs. As of TensorFlow 2.0 the default execution is eager, but you may find it useful to switch between the two modes during debugging. It only requires a few changes to the code.
Assume the following very simple function
import tensorflow as tf def sqrt(x): return tf.math.sqrt(x) x = tf.constant([1.0, 2.0, 3.0]) x_sqrt = sqrt(x)
If you run this in TensorFlow 2.0 it will run in eager mode. The same code in graph mode would look like:
import tensorflow as tf def sqrt(x): return tf.math.sqrt(x) x = tf.constant([1.0, 2.0, 3.0]) x_sqrt = sqrt(x) tf_sqrt = tf.function(sqrt)
The only difference is that we wrapped
sqrt in a
tf.function call. It is
also common to see functions wrapped using the
import tensorflow as tf @tf.function def sqrt(x): return tf.math.sqrt(x)
A TensorFlow function runs the decoration function, when possible, as a TensorFlow graph. This has the side effect that any print statements included in the function would only get executed once as the graph is constructed.
Read more about
tf.function in the
You can turn all supported calls to
tf.graph into eager mode by setting
tf.config.run_functions_eagerly(True). In our simple example above there is
little benefit, but for a function that iterates it can be useful to print (or
checkpoint) the intermediate values.
You should substitute Python
tf.print to print the actual
Tensor values during graph execution. Internally
tf.print works by appending
a pair of nodes (String and PrintV2) to graph. As in traditional debugging,
adding prints may work if the code is small or you already have gathered
enough information to know roughly where the problem is. Adding
every tensor in your code is not effective.
@tf.function def my_func(x): for i in range(x): intermediate = tf.math.reduce_sum(x) + x[i] + pi print(intermediate) # Symbolic representation only (printed during tracing!) tf.print(intermediate) # Prints the actual tensor values
Read more about
tf.print in the
TensorFlow 2.4.1 introduced
tf.debugging.enable_check_numerics that when
activated will stop execution as soon as a tensor contains
The call introduces numerical checks in your graph - overhead - and thereby
extending your runtime and memory requirements. It should only be set during
This contrived example illustrates how a
+Inf is caught and reported.
import tensorflow as tf tf.debugging.enable_check_numerics(False) # Numeric checking is disabled @tf.function def weighted_normalized(vec): eps = tf.constant(0.005) res = vec / (tf.math.reduce_sum(vec) + eps) return res # This is fine. vec = tf.constant([0.0, 1.0, 2.0, 3.0]) wn_vec = weighted_normalized(vec) # [0 0.166527897 0.333055794 0.499583662]) # This is not fine... but with numeric checking disabled it will not be caught. vec2 = tf.constant([-0.005, 0]) wn_vec2 = weighted_normalized(vec2) # [-inf nan] <-- NOT OK
If we run the above we get no indications that something is wrong even though
the second tensor contains
[-inf nan]. If we re-run with
tf.debugging.enable_check_numerics(True) we get the following printed to stdout.
Traceback (most recent call last): File "test_tf.check_numerics", line 22, in <module> wn_vec2 = weighted_normalized(vec2) # [-inf nan] <-- NOT OK File "/apps/Alvis/software/MPI/GCC-CUDA/9.3.0-11.0.2/OpenMPI/4.0.3/TensorFlow/2.3.1-Python-3.8.2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__ result = self._call(*args, **kwds) File "/apps/Alvis/software/MPI/GCC-CUDA/9.3.0-11.0.2/OpenMPI/4.0.3/TensorFlow/2.3.1-Python-3.8.2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 814, in _call results = self._stateful_fn(*args, **kwds) File "/apps/Alvis/software/MPI/GCC-CUDA/9.3.0-11.0.2/OpenMPI/4.0.3/TensorFlow/2.3.1-Python-3.8.2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__ return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "/apps/Alvis/software/MPI/GCC-CUDA/9.3.0-11.0.2/OpenMPI/4.0.3/TensorFlow/2.3.1-Python-3.8.2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1843, in _filtered_call return self._call_flat( File "/apps/Alvis/software/MPI/GCC-CUDA/9.3.0-11.0.2/OpenMPI/4.0.3/TensorFlow/2.3.1-Python-3.8.2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1923, in _call_flat return self._build_call_outputs(self._inference_function.call( File "/apps/Alvis/software/MPI/GCC-CUDA/9.3.0-11.0.2/OpenMPI/4.0.3/TensorFlow/2.3.1-Python-3.8.2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 545, in call outputs = execute.execute( File "/apps/Alvis/software/MPI/GCC-CUDA/9.3.0-11.0.2/OpenMPI/4.0.3/TensorFlow/2.3.1-Python-3.8.2/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 138, in execute_with_callbacks tensors = quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) File "/apps/Alvis/software/MPI/GCC-CUDA/9.3.0-11.0.2/OpenMPI/4.0.3/TensorFlow/2.3.1-Python-3.8.2/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: !!! Detected Infinity or NaN in output 0 of graph op "RealDiv" (# of outputs: 1) !!! dtype: <dtype: 'float32'> shape: (2,) Input tensors (2): 0: Tensor("vec:0", shape=(2,), dtype=float32) 1: Tensor("add:0", shape=(), dtype=float32) Graph name: "weighted_normalized" Stack trace of op's creation ("->": inferred user code): + ... (Omitted 18 frames) + ...8/site-packages/tensorflow/python/framework/ops.py (L1949) __init__ | self._traceback = tf_stack.extract_stack() : Tensor had -Inf and NaN values [[node truediv/CheckNumericsV2 (defined at test_tf.check_numerics:10) ]] [Op:__inference_weighted_normalized_43] Function call stack: weighted_normalized
Which clearly says that we have numerical issues in the graph
Read more about
tf.debugging.enable_check_numerics in the
Logfiles can be a rich source of information, and every now and then they contain just enough to point us in the right direction. TensorFlow, together with Keras, has many types of logs, for instance, autograph logs, model loss logs, summary usage logs, training metric logs etc. In general most logging facilities support different levels of severity (sometimes called levels). TensorFlow supports the following five standard severity levels, in order of least to most severe: Supported levels: * DEBUG * ERROR * FATAL * INFO * WARN
The most verbose setting is the least severe - i.e
DEBUG level produces the
most output, and most will likely not be relevant.
The following example gets the TensorFlow log and sets it in the
import tensorflow as tf import logging tf_logger = tf.get_logger("tensorflow") # Any name will work tf_logger.setLevel(tf_logger.DEBUG)
Warning: Beware of large logs and save them in a NOBACKUP area or in tmpdir.
Read more about
tf.get_logger in the
Checkpoint your code🔗
Long running cluster jobs that for whatever reason crashes is not only very
annoying but also wastes precious compute hours from your allocation. In the
case of TensorFlow crashes can be challenging to debug, especially if it is
due to numerical instabilities (although there are tricks to discover
+-Inf, as we will show) that appear far along in training. Checkpointing
allows you to regularly and automatically save the state of your model to disk.
If a crash occurs, you can debug or resume your job at the most recent (and
correct) checkpoint. It is thus recommended for several reasons that you
investigate how checkpointing fits your code - and TensorFlow offers a few options.
You should start by looking into
tf.train.Checkpoint or the Keras callback
tf.keras.callbacks.ModelCheckpoint. If you checkpoint every hour, you can
begin your debugging at the latest one hour before the bug (or crash) began.
Here is an example
how checkpointing works in TensorFlow on the Alvis system.
Read more about checkpointing in the official documentation.
Visualise using TensorBoard🔗
TensorBoard is a visualisation toolkit that comes bundled with TensorFlow. TensorBoard can be used for many different types of debugging, and not exclusively for TensorFlow. For instance, you can visualise the resulting graph, find traces and profiling information.
TensorBoard is interactive and you can inspect your graph in a browser. For more information about TensorBoard read our TensorBoard guide.
Experimental API debugging🔗
We finish with a recommendation to keep an eye in the experimental API for new features related to debugging. Note that the API that is marked as experimental may change with short notice, and may become dropped in later versions.
tf.debugging.experimental.enable_dump_debug_info as of TensorFlow
2.3 is marked as experimental. How you call it may change between versions.
Please consult the
documentation for tf.debugging
for the current way of using it for the version of TensorFlow you are running.
You can dump large amount of debug information to a directory using the
tf.debugging.experimental.enable_dump_debug_info() call at the start of your
import tensorflow as tf tf.debugging.experimental.enable_dump_debug_info( dump_root="logs/dbg", tensor_debug_mode="FULL_HEALTH", circular_buffer_size=-1) # Rest of program
tensor_debug_mode in the above example is set to maximum
verbosity, producing information about each floating-dtype tensor including
+- Inf and
NaN. You can increase the amount of debug information
by increasing the
circular_buffer_size. The value
-1 disable the circular
buffer - and we keep all data. The debug data also includes events and
annotated source code tracing data. It is mean to be visualised using
You view the data by launching TensorBoard on the same machine were you dumped
the data. TensorBoard is bundled with TensorFlow. If you have loaded TensorFlow
from the module system you can launch TensorBoard using the
$ tensorboard --logdir logs/dbg
Read more about the experimental API in the official documentation
Machine learning datasets commonly can be made up from a multitude of files. In HPC environments these can be less than ideal. Reading (and writing) is generally a lot slower compared when done to multiple files compared to a single large files.
To get around this you should read directly from archives or if you are doing some preprocessing and are planning on using your data multiple, then you can save the preprocessed data into a few tfrecords instead.
To read directly from archives you can use the TensorFlow Datasets module
module spider TensorFlow-Datasets). For example to read images:
import tensorflow_datasets as tfds # Initialize to get access to method (we do not plan on downloading anything) dl_manager = tfds.download.DownloadManager(download_dir='/tmp') # Iterate over archive path_to_archive = '/mimer/NOBACKUP/Datasets/Lyft-level5/Perception/train.tar' image_feature = tfds.core.features.Image() for filename, f in dl_manager.iter_archive(path_to_archive): if filename.endswith('.jpeg'): img = image_feature.decode_example(f.read()) print(img.shape)