- Fixes bf16 convolution failures on NVIDA Ampere GPUs.
- Testing covered by one BF16 convolution test on xla/tests/convolution_test
PiperOrigin-RevId: 413198102
Change-Id: I2d2c6dc4e561718f700bf56a06e4810acc8217e4
Mainly: The train_step was broken: the x/y args to the loss were reversed (CE is not symetric). Use `keras.Model.train_step`.
Also:
* Set the right dtype on the data to minimize casting.
* Use a tf.data.Dataset for training to minimize index twiddling.
* Upgrade plots.
* Don't reset the model: train->convert to lite->continue training
* Compare the model output before and after converting to lite.
* Compare the interpreter output before and after loading the on-device-trained-checkpoint.
* Add colors to the labels on the final plot.
PiperOrigin-RevId: 413198035
Change-Id: Ie44bee253f20a88f25d98e69ffb7cfe1cdea0f8d
Without this change, with tf.where enabled in autoclustering,
`//third_party/tensorflow/python/kernel_tests/nn_ops:embedding_ops_test_xla_gpu`
fails with a miscompare.
I can't generate test due to the input complexity (bug is almost certainly in
value inference), but this will be properly tested once tf.where is compiled in
autoclustering mode.
PiperOrigin-RevId: 413192927
Change-Id: I43d0f6f93e706810ad219d657d01b95c14a7426b
- now we have two Embedding side effects, read and write
- now dependencies between EnqueueTPUEmbedding ops with same device ordinal are
properly modeled
- we now finally don't have any Embedding-specific code left in side effect
analysis
- introduced new `TF_MustExecute` trait that avoids pruning of an op; this is
useful for side-effecting ops that don't produce any output and don't have
dependencies to/from other ops
- for ops that just used `TF_TPUEmbeddingSideEffect` to avoid pruning, use new
`TF_MustExecute` trait instead
- in contrast to the old `TF_TPUEmbeddingSideEffect`, `TF_MustExecute` avoids
pruning independent of reachability (see new graph pruning test)
PiperOrigin-RevId: 413175982
Change-Id: I7b65c7a0e8a17b8a1683a0e01d1fd0614f7ac95a
Also make mhlo the source of truth for map_mhlo_to_scalar_op.h and split it off map_lhlo_to_scalar_op.h.
Users of the lhlo to linalg transformation should go from mhlo to linalg and then bufferize at the linalg level.
PiperOrigin-RevId: 413146497
Change-Id: Ieb3bbe568bd298bf0475167dad5e5e975c8f08bf
This patch uses an existing mechanism to filter out the early loop
full unroll pass in the LLVM pipeline. This pass applies excessive
unrolling in statically bounded low trip-count loops, which are
very common in XLA. This creates a strong dependency on the SLP
vectorizer to produce all the vector code, since the loops are fully
unrolled before they reach the Loop Vectorizer. By disabling it,
the Loop Vectorizer would have an opportunity to vectorize the code.
A later loop unroll pass will still unroll the loops before SLP for
those cases missed by the Loop Vectorizer.
PiperOrigin-RevId: 413111548
Change-Id: I1ee747ff76257e698f6f1453a385177f85195dc9
This is possible after 413001036 which broke the dependency on XLA GPU
which initializes Cuda and the GPU.
PiperOrigin-RevId: 413058845
Change-Id: Ic53f951c05da8f444c072d67b2b9877948b92eb0
This ensures that all function captures (which includes slot variables) are loaded before the functions that use them. Previously, function captures are loaded in the alphabetical order of the function names, which can cause problems if functions depend on the output of other functions (resource handles are generated by `create_resource`).
PiperOrigin-RevId: 413006430
Change-Id: I82faecd089eeae7e28920d8cc0ae993e58833b49