Machine LearningNews
PyTorch 0.3.0 releases, ending stochastic functions
The new version comes with several performance improvements, ONNX/CUDA 9/CUDNN 7 Support and important bug fixes.
PyTorch 0.3.0 has removed stochastic functions, i.e. Variable.reinforce(), citing “limited functionality and broad performance implications.”
The Python package has added a number of performance improvements, new layers, support to ONNX, CUDA 9, cuDNN 7, and “lots of bug fixes” in the new version.
“The motivation for stochastic functions was to avoid bookkeeping of sampled values. In practice, users were still bookkeeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change,” PyTorch team said.
To replace stochastic functions, they have introduced the torch.distributions package.
So if your previous code looked like this:
probs = policy_network(state)
action = probs.multinomial()
next_state, reward = env.step(action)
action.reinforce(reward)
action.backward()
This could be the new equivalent code:
probs = policy_network(state)
# NOTE: categorical is equivalent to what used to be called multinomial
m = torch.distributions.Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = m.log_prob(action) * reward
loss.backward()
What is new in PyTorch 0.3.0?
Unreduced losses
Now, Some loss functions can compute persample losses in a minibatch
 By default PyTorch sums losses over the minibatch and returns a single scalar loss. This was limiting to users.
 Now, a subset of loss functions allow specifying
reduce=False
to return individual losses for each sample in the minibatch  Example:
loss = nn.CrossEntropyLoss(..., reduce=False)
 Currently supported losses:
MSELoss
,NLLLoss
,NLLLoss2d
,KLDivLoss
,CrossEntropyLoss
,SmoothL1Loss
,L1Loss
 More loss functions will be covered in the next release
An inbuilt Profiler in the autograd engine
PyTorch has built a lowlevel profiler to help you identify bottlenecks in your models.
Let us start with an example:
>>> x = Variable(torch.randn(1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
... y = x ** 2
... y.backward()
>>> # NOTE: some columns were removed for brevity
... print(prof)
  
Name CPU time CUDA time
  
PowConstant 142.036us 0.000us
N5torch8autograd9GraphRootE 63.524us 0.000us
PowConstantBackward 184.228us 0.000us
MulConstant 50.288us 0.000us
PowConstant 28.439us 0.000us
Mul 20.154us 0.000us
N5torch8autograd14AccumulateGradE 13.790us 0.000us
N5torch8autograd5CloneE 4.088us 0.000us
The profiler works for both CPU and CUDA models. For CUDA models, you have to run your python program with a special nvprof
prefix. For example:
nvprof profilefromstart off o trace_name.prof  python <your arguments>
# in python
>>> with torch.cuda.profiler.profile():
... model(x) # Warmup CUDA memory allocator and profiler
... with torch.autograd.profiler.emit_nvtx():
... model(x)
Then, you can load trace_name.prof
in PyTorch and print a summary profile report.
>>> prof = torch.autograd.profiler.load_nvprof('trace_name.prof')
>>> print(prof)
For additional documentation, you can visit this link.
Higher order gradients
v0.3.0 has added higherorder gradients support for the following layers:
 ConvTranspose, AvgPool1d, AvgPool2d, LPPool2d, AvgPool3d, MaxPool1d, MaxPool2d, AdaptiveMaxPool, AdaptiveAvgPool, FractionalMaxPool2d, MaxUnpool1d, MaxUnpool2d, nn.Upsample, ReplicationPad2d, ReplicationPad3d, ReflectionPad2d
 PReLU, HardTanh, L1Loss, SoftSign, ELU, RReLU, Hardshrink, Softplus, SoftShrink, LogSigmoid, Softmin, GLU
 MSELoss, SmoothL1Loss, KLDivLoss, HingeEmbeddingLoss, SoftMarginLoss, MarginRankingLoss, CrossEntropyLoss
 DataParallel
Optimizers
 optim.SparseAdam: Implements a lazy version of Adam algorithm suitable for sparse tensors.
(In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.)
 Optimizers now have an add_param_group function that lets you add new parameter groups to an already constructed optimizer.
New layers and nn functionality
 Added AdpativeMaxPool3d and AdaptiveAvgPool3d
 Added LPPool1d
 F.pad now has support for:
 ‘reflection’ and ‘replication’ padding on 1d, 2d, 3d signals (so 3D, 4D and 5D Tensors)
 constant padding on nd signals
 nn.Upsample now works for 1D signals (i.e. B x C x L Tensors) in
nearest
andlinear
modes.  Allow user to not specify certain input dimensions for
AdaptivePool*d
and infer them at runtime.
For example:

# target output size of 10x7 m = nn.AdaptiveMaxPool2d((None, 7))
 DataParallel container on CPU is now a noop (instead of erroring out)
New Tensor functions and features
 Introduced
torch.erf
andtorch.erfinv
that compute the error function and the inverse error function of each element in the Tensor.  Adds broadcasting support to bitwise operators
 Added
Tensor.put_
andtorch.take
similar tonumpy.take
andnumpy.put
. The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
first. The output has the same shape as the indices.  The put function copies value into a tensor also using linear indices.
 The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
 Adds
zeros
andzeros_like
for sparse Tensors.  1element Tensors can now be casted to Python scalars. For example:
int(torch.Tensor([5]))
works now.
Other additions
 Added
torch.cuda.get_device_name
andtorch.cuda.get_device_capability
that do what the names say. Example:>>> torch.cuda.get_device_name(0) 'Quadro GP100' >>> torch.cuda.get_device_capability(0) (6, 0)
 If one sets
torch.backends.cudnn.deterministic = True
, then the CuDNN convolutions use deterministic algorithms torch.cuda_get_rng_state_all
andtorch.cuda_set_rng_state_all
are introduced to let you save / load the state of the random number generator over all GPUs at oncetorch.cuda.emptyCache()
frees the cached memory blocks in PyTorch’s caching allocator. This is useful when having longrunning ipython notebooks while sharing the GPU with other processes.
API changes
softmax
andlog_softmax
now take adim
argument that specifies the dimension in which slices are taken for the softmax operation.dim
allows negative dimensions as well (dim = 1
will be the last dimension)torch.potrf
(Cholesky decomposition) is now differentiable and defined onVariable
 Remove all instances of
device_id
and replace it withdevice
, to make things consistent torch.autograd.grad
now allows you to specify inputs that are unused in the autograd graph if you useallow_unused=True
This gets useful when usingtorch.autograd.grad
in large graphs with lists of inputs / outputs
For example:x, y = Variable(...), Variable(...) torch.autograd.grad(x * 2, [x, y]) # errors torch.autograd.grad(x * 2, [x, y], allow_unused=True) # works
pad_packed_sequence
now allows apadding_value
argument that can be used instead of zeropaddingDataset
now has a+
operator (which usesConcatDataset
). You can do something likeMNIST(...) + FashionMNIST(...)
for example, and you will get a concatenated dataset containing samples from both.torch.distributed.recv
allows Tensors to be received from any sender (hence,src
is optional).recv
returns the rank of the sender. adds
zero_()
toVariable
Variable.shape
returns the size of the Tensor (now made consistent with Tensor)torch.version.cuda
specifies the CUDA version that PyTorch was compiled with Added a missing function
random_
for CUDA.  torch.load and torch.save can now take a
pathlib.Path
object, which is a standard Python3 typed filepath object  If you want to load a model’s
state_dict
into another model (for example to finetune a pretrained network),load_state_dict
was strict on matching the key names of the parameters. Now Pytorch provides astrict=False
option toload_state_dict
where it only loads in parameters where the keys match, and ignores the other parameter keys.  added
nn.functional.embedding_bag
that is equivalent tonn.EmbeddingBag
Performance Improvements
 The overhead of
torch
functions on Variables was around 10 microseconds. This has been brought down to ~1.5 microseconds by moving most of the core autograd formulas into C++ using ATen library.  softmax and log_softmax are now 4x to 256x faster on the GPU after rewriting the gpu kernels
 2.5x to 3x performance improvement of the distributed AllReduce (gloo backend) by enabling GPUDirect
 nn.Embedding’s renorm option is much faster on the GPU. For embedding dimensions of
100k x 128
and a batch size of 1024, it is 33x faster.  All pointwise ops now use OpenMP and get multicore CPU benefits
 Added a singleargument version of
torch.arange
. For exampletorch.arange(10)
Framework Interoperability
DLPack Interoperability
DLPack Tensors are crossframework Tensor formats. We now have torch.utils.to_dlpack(x)
and torch.utils.from_dlpack(x)
to convert between DLPack and torch Tensor formats. The conversion has zero memory copy and hence is very efficient.
Model exporter to ONNX
ONNX is a common model interchange format that can be executed in Caffe2, CoreML, CNTK, MXNet, and Tensorflow at the moment. PyTorch models that are ConvNetlike and RNNlike (static graphs) can now be shipped to the ONNX format.
There is a new module torch.onnx (http://pytorch.org/docs/0.3.0/onnx.html) which provides the API for exporting ONNX models.
The operations supported in this release are:
 add, sub (nonzero alpha not supported), mul, div, cat, mm, addmm, neg, tanh, sigmoid, mean, t, transpose, view, split, squeeze
 expand (only when used before a broadcasting ONNX operator; e.g., add)
 prelu (single weight shared among input channels not supported)
 threshold (nonzero threshold/nonzero value not supported)
 Conv, ConvTranspose, BatchNorm, MaxPool, RNN, Dropout, ConstantPadNd, Negate
 elu, leaky_relu, glu, softmax, log_softmax, avg_pool2d
 unfold (experimental support with ATenCaffe2 integration)
 Embedding (no optional arguments supported)
 RNN
 FeatureDropout (training mode not supported)
 Index (constant integer and tuple indices supported)
Usability Improvements
 More cogent error messages during indexing of Tensors / Variables
Breaking changes  Add proper error message for specifying dimension on a tensor with no dimensions
 better error messages for Conv*d input shape checking
 More userfriendly error messages for LongTensor indexing
 Better error messages and argument checking for Conv*d routines
 Trying to construct a Tensor from a Variable fails more appropriately
 If you are using a PyTorch binary with insufficient CUDA version, then a
warning
is printed to the user.  Fixed incoherent error messages in
load_state_dict
 Fix error message for type mismatches with sparse tensors
Bug fixes
torch
 Fix CUDA lazy initialization to not trigger on calls to
torch.manual_seed
(instead, the calls are queued and run when CUDA is initialized)
Tensor
 if
x
is 2D,x[[0, 3],]
was needed to trigger advanced indexing. The trailing comma is no longer needed, and you can dox[[0, 3]]
x.sort(descending=True)
used to incorrectly fail for Tensors. Fixed a bug in the argument checking logic to allow this. Tensor constructors with numpy input:
torch.DoubleTensor(np.array([0,1,2], dtype=np.float32))
 torch will now copy the contents of the array in a storage of appropriate type.
 If types match, it will share the underlying array (nocopy), with equivalent semantics to initializing a tensor with another tensor.
 On CUDA,
torch.cuda.FloatTensor(np.random.rand(10,2).astype(np.float32))
will now work by making a copy.
ones_like
andzeros_like
now create Tensors on the same device as the original Tensorexpand
andexpand_as
allow expanding an empty Tensor to another empty Tensor torch.HalfTensor supports
numpy()
andtorch.from_numpy
 Added additional size checking for
torch.scatter
 Fixed
random_
on CPU (which previously had a max value of 2^32) for DoubleTensor and LongTensor  Fix
ZeroDivisionError: float division by zero
when printing certain Tensors torch.gels
whenm > n
had a truncation bug on the CPU and returned incorrect results. Fixed. Added a check in tensor.numpy() that checks if no positional arguments are passed
 Before a Tensor is moved to CUDA pinned memory, added a check to ensure that it is
contiguous
 Fix
symeig
on CUDA for large matrices. The bug is that not enough space was being allocated for the workspace, causing some undefined behavior.  Improved the numerical stability of
torch.var
andtorch.std
by using Welford’s algorithm  The Random Number Generator returned
uniform
samples with inconsistent bounds (inconsistency in cpu implementation and running into a cublas bug). Now, all
uniform
sampled numbers will return within the bounds[0, 1)
, across all types and devices
 Now, all
 Fixed
torch.svd
to not segfault on large CUDA Tensors (fixed an overflow error in the magma bindings)  Allows empty index Tensor for
index_select
(instead of erroring out)  Previously when
eigenvector=False
,symeig
returned some unknown value for the eigenvectors. Now this is corrected.
sparse
 Fix bug with ‘coalesced’ calculation in sparse ‘cadd’
 Fixes
.type()
not converting indices tensor.  Fixes sparse tensor coalesce on the GPU in corner cases
autograd
 Fixed crashes when calling backwards on leaf variable with requires_grad=False
 fix bug on Variable
type()
around nondefault GPU input.  when
torch.norm
returned0.0
, the gradient wasNaN
. We now use the subgradient at0.0
, so the gradient is0.0
.  Fix an correctness issue with advanced indexing and higherorder gradients
torch.prod
‘s backward was failing on the GPU due to a type error, fixed. Advanced Indexing on Variables now allows the index to be a LongTensor backed Variable
 Variable.cuda() and Tensor.cuda() are consistent in kwargs options
optim
torch.optim.lr_scheduler
is now imported by default.
nn
 Returning a dictionary from a nn.Module’s forward function is now supported (used to throw an error)
 When
register_buffer("foo", ...)
is called, and self.foo already exists, then instead of silently failing, now raises aKeyError
 Fixed loading of older checkpoints of RNN/LSTM which were missing
_data_ptrs
attributes. nn.Embedding
had a hard error when using themax_norm
option. This is fixed now. when using the
max_norm
option, the passedin indices are written upon (by the underlying implementation). To fix this, pass a clone of the indices to the renorm kernel. F.affine_grid
now can take noncontiguous inputs EmbeddingBag can accept both 1D and 2D inputs now.
 Workaround a CuDNN bug where batch sizes greater than 131070 fail in CuDNN BatchNorm
 fix nn.init.orthogonal to correctly return orthonormal vectors when rows < cols
 if BatchNorm has only
1
value per channel in total, raise an error in training mode.  Make cuDNN bindings respect the current cuda stream (previously raised incoherent error)
 fix grid_sample backward when gradOutput is a zerostrided Tensor
 Fix a segmentation fault when reflection padding is out of Tensor bounds.
 If LogSoftmax has only 1 element,
inf
was returned. Now this correctly returns0.0
 Fix pack_padded_sequence to accept inputs of arbitrary sizes (not just 3D inputs)
 Fixed ELU higher order gradients when applied inplace
 Prevent numerical issues with
poisson_nll_loss
whenlog_input=False
by adding a small epsilon
distributed and multigpu
 Allow kwargsonly inputs to DataParallel. This used to fail:
n = nn.DataParallel(Net()); out = n(input=i)
 DistributedDataParallel calculates num_samples correctly in python2
 Fix the case of DistributedDataParallel when 1GPU per process is used.
 Allow some params to be
requires_grad=False
in DistributedDataParallel  Fixed DataParallel to specify GPUs that don’t include GPU0
 DistributedDataParallel’s exit doesn’t error out anymore, the daemon flag is set.
 Fix a bug in DistributedDataParallel in the case when model has no
buffers
(previously raised incoherent error)  Fix
__get_state__
to be functional inDistributedDataParallel
(was returning nothing)  Fix a deadlock in the NCCL bindings when GIL and CudaFreeMutex were starving each other
Among other fixes,model.zoo.load_url
now first attempts to use the requests
library if available, and then falls back to urllib
.
To download the source code, click here.