cntk.train.distributed module¶

Distributed learners manage learners in distributed environment.

class Communicator(*args, **kwargs)[source]¶

Bases: cntk.cntk_py.DistributedCommunicator

A communicator interface exposing communication primitives that serve as building blocks for distributed training.

barrier()[source]¶: Sync point to make sure all workers reach the same state.

current_worker()[source]¶

Returns worker descriptor of current process.

Returns:	descriptor of current process.
Return type:	`WorkerDescriptor`

static finalize()[source]¶: Should be called when all communication is finished. No more communication should happen after this call.

is_main()[source]¶: Indicates if the current communicator is instantiated on the main node. The node with rank 0 is considered the main.

static num_workers()[source]¶: Returns information about all MPI workers.

static rank()[source]¶: Returns rank of current process.

workers()[source]¶

Returns workers in this communicator.

Returns:	workers in this communicator.
Return type:	(list) of `WorkerDescriptor`

class DistributedLearner(*args, **kwargs)[source]¶

Bases: cntk.cntk_py.DistributedLearner

A distributed learner that handles data like gradients/momentums across multiple MPI workers

communicator()[source]¶

Returns the distributed communicator that talks to other MPI workers

Returns:	descriptor of current process.
Return type:	`Communicator`

total_number_of_samples_seen¶: The number of samples seen by the distributed learner.

class WorkerDescriptor[source]¶

Bases: cntk.cntk_py.DistributedWorkerDescriptor

Distributed worker descriptor, returned by Communicator instance.

global_rank¶: The global rank of the worker.

host_id¶: The host id of the worker.

block_momentum_distributed_learner(learner, block_size, block_momentum_as_time_constant=None, use_nestrov_momentum=True, reset_sgd_momentum_after_aggregation=True, block_learning_rate=1.0, distributed_after=0)[source]¶

Creates a block momentum distributed learner. See [1] for more information.

Block Momentum divides the full dataset into M non-overlapping blocks, and each block is partitioned into N non-overlapping splits.

During training, a random, unprocessed block is randomly taken by the trainer and the N partitions of this block are dispatched on the workers.

Parameters:

learner – a local learner (i.e. sgd)
block_size (int) – size of the partition in samples
block_momentum_as_time_constant (float) – block momentum as time constant
use_nestrov_momentum (bool) – use nestrov momentum
reset_sgd_momentum_after_aggregation (bool) – reset SGD momentum after aggregation
block_learning_rate (float) – block learning rate
distributed_after (int) – number of samples after which distributed training starts

Returns:

a distributed learner instance

data_parallel_distributed_learner(learner, distributed_after=0, num_quantization_bits=32, use_async_buffered_parameter_update=False)[source]¶

Creates a data parallel distributed learner

Parameters:	learner – a local learner (i.e. sgd) distributed_after (int) – number of samples after which distributed training starts num_quantization_bits (int) – number of bits for quantization (1 to 32) use_async_buffered_parameter_update (bool) – use async buffered parameter update, currently must be False
Returns:	a distributed learner instance

mpi_communicator()[source]¶: Creates a non quantized MPI communicator.