cntk.train.distributed module¶
Distributed learners manage learners in distributed environment.
-
class
Communicator
(*args, **kwargs)[source]¶ Bases:
cntk.cntk_py.DistributedCommunicator
A communicator interface exposing communication primitives that serve as building blocks for distributed training.
-
current_worker
()[source]¶ Returns worker descriptor of current process.
Returns: descriptor of current process. Return type: WorkerDescriptor
-
static
finalize
()[source]¶ Should be called when all communication is finished. No more communication should happen after this call.
-
is_main
()[source]¶ Indicates if the current communicator is instantiated on the main node. The node with rank 0 is considered the main.
-
workers
()[source]¶ Returns workers in this communicator.
Returns: workers in this communicator. Return type: (list) of WorkerDescriptor
-
-
class
DistributedLearner
(*args, **kwargs)[source]¶ Bases:
cntk.cntk_py.DistributedLearner
A distributed learner that handles data like gradients/momentums across multiple MPI workers
-
communicator
()[source]¶ Returns the distributed communicator that talks to other MPI workers
Returns: descriptor of current process. Return type: Communicator
-
total_number_of_samples_seen
¶ The number of samples seen by the distributed learner.
-
-
class
WorkerDescriptor
[source]¶ Bases:
cntk.cntk_py.DistributedWorkerDescriptor
Distributed worker descriptor, returned by
Communicator
instance.-
global_rank
¶ The global rank of the worker.
-
host_id
¶ The host id of the worker.
-
-
block_momentum_distributed_learner
(learner, block_size, block_momentum_as_time_constant=None, use_nestrov_momentum=True, reset_sgd_momentum_after_aggregation=True, block_learning_rate=1.0, distributed_after=0)[source]¶ Creates a block momentum distributed learner. See [1] for more information.
Block Momentum divides the full dataset into M non-overlapping blocks, and each block is partitioned into N non-overlapping splits.
During training, a random, unprocessed block is randomly taken by the trainer and the N partitions of this block are dispatched on the workers.
Parameters: - learner – a local learner (i.e. sgd)
- block_size (int) – size of the partition in samples
- block_momentum_as_time_constant (float) – block momentum as time constant
- use_nestrov_momentum (bool) – use nestrov momentum
- reset_sgd_momentum_after_aggregation (bool) – reset SGD momentum after aggregation
- block_learning_rate (float) – block learning rate
- distributed_after (int) – number of samples after which distributed training starts
Returns: a distributed learner instance
See also
[1] K. Chen and Q. Huo. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. Proceedings of ICASSP, 2016.
-
data_parallel_distributed_learner
(learner, distributed_after=0, num_quantization_bits=32, use_async_buffered_parameter_update=False)[source]¶ Creates a data parallel distributed learner
Parameters: - learner – a local learner (i.e. sgd)
- distributed_after (int) – number of samples after which distributed training starts
- num_quantization_bits (int) – number of bits for quantization (1 to 32)
- use_async_buffered_parameter_update (bool) – use async buffered parameter update, currently must be False
Returns: a distributed learner instance