CNTK 208: Training Acoustic Model with Connectionist Temporal Classification (CTC) Criteria

This tutorial assumes familiarity with 10* CNTK tutorials and basic knowledge of data representation in acoustic modelling tasks. It introduces some CNTK building blocks that can be used in training deep networks for speech recognition on the example of CTC training criteria.

Introduction

CNTK implementation of CTC is based on the paper by A. Graves et al. “Connectionist temporal classification: labeling unsegmented sequence data with recurrent neural networks”. CTC is a popular training criteria for sequence learning tasks, such as speech or handwriting. It doesn’t require segmentation of training data nor post-processing of network outpus to convert them to labels. Thereby, it significantly simplifies training and decoding processes while achieving state of the art accuracy.

CTC training runs on several sequences in parallel either on GPU or CPU, achieving maximal utilization of the hardware.
image0

First let us import some of the necessary libraries including CNTK and setup the testing environment.

In [1]:
import os
import cntk as C
import numpy as np

# Select the right target device
import cntk.tests.test_utils
cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for our build system)

data_dir = os.path.join("..", "Tests", "EndToEndTests", "Speech", "Data")
print("Current directory {0}".format(os.getcwd()))

if os.path.exists(data_dir):
    if os.path.realpath(data_dir) != os.path.realpath(os.getcwd()):
        os.chdir(data_dir)
        print("Changed to data directory {0}".format(data_dir))
else:
    print("Data directory not available locally. Downloading data.")
    try:
        from urllib.request import urlretrieve
    except ImportError:
        from urllib import urlretrieve
    for dir in ['GlobalStats', 'Features']:
        if not os.path.exists(dir):
            os.mkdir(dir)
    for file in ['glob_0000.scp', 'glob_0000.write.scp', 'glob_0000.mlf', 'state_ctc.list', 'GlobalStats/mean.363', 'GlobalStats/var.363', 'Features/000000000.chunk']:
        if os.path.exists(file):
            print('Already downloaded %s' % file)
        else:
            print('Downloading %s' % file)
            urlretrieve('https://github.com/Microsoft/CNTK/raw/release/2.6/Tests/EndToEndTests/Speech/Data/%s' % file, file)

Current directory D:\users\vadimma\cntk_tut\CNTK\Tutorials
Changed to data directory ..\Tests\EndToEndTests\Speech\Data

Read data

CNTK consumes Acoustic Model (AM) training data in HTK/MLF format and typically expects 3 input files * SCP file with features. SCP file contains mapping of utterance ids to corresponding feature files. * MLF file with labels. MLF (master label file) is a traditional format for representing transcription alignment to features. Even though the referenced MLF file contains label boundaries, they are not needed during CTC training and ignored. For more details on feature/label formats, refer to a copy of HTK book, e.g. here * States list file. This file contains the list of all labels (states) in the training set. The blank label, required by CTC, is located in the end of the file at index (line) 132, assuming 0-based indexing.

CNTK provides flexible and efficient readers HTKFeatureDeserializer/HTKMLFDeserializer for acoustic features and labels. These readers follow convention over configuration principle and greatly simply training procedure. At the same time, they take care of various optimizations of reading from disk/network, CPU and GPU asynchronous prefetching which resuls in significant speed up of model training.

Note: Currently, CTC training expects label and feature inputs of the same dimension, yet the labels don’t have to be aligned. An easy way to generate the label file is to have uniform (equal) distribution of the labels across the feature frames. Obviously, some labels will be mis-aligned with this setup, but CTC criteria will take care of it during training, see the original publication for reference.

In [2]:
# Type of features/labels and dimensions are application specific
# Here we use rather small dimensional feature and the label set for the sake of keeping the train set compact.
feature_dimension = 33
feature = C.sequence.input((feature_dimension))

label_dimension = 133
label = C.sequence.input((label_dimension))

train_feature_filepath = "glob_0000.scp"
train_label_filepath = "glob_0000.mlf"
mapping_filepath = "state_ctc.list"
try:
    train_feature_stream = C.io.HTKFeatureDeserializer(
    C.io.StreamDefs(speech_feature = C.io.StreamDef(shape = feature_dimension, scp = train_feature_filepath)))
    train_label_stream = C.io.HTKMLFDeserializer(
    mapping_filepath, C.io.StreamDefs(speech_label = C.io.StreamDef(shape = label_dimension, mlf = train_label_filepath)), True)
    train_data_reader = C.io.MinibatchSource([train_feature_stream, train_label_stream], frame_mode = False)
    train_input_map = {feature: train_data_reader.streams.speech_feature, label: train_data_reader.streams.speech_label}
except RuntimeError:
    print ("ERROR: not able to read features or labels")

Model creation

In this block we first normalize the features and define a model with LSTM Layers. We normalize the input features to zero mean and unit variance by subtracting the mean vector and multiplying by inverse standard deviation, which are stored in separate files.

In [3]:
feature_mean = np.fromfile(os.path.join("GlobalStats", "mean.363"), dtype=float, count=feature_dimension)
feature_inverse_stddev = np.fromfile(os.path.join("GlobalStats", "var.363"), dtype=float, count=feature_dimension)

feature_normalized = (feature - feature_mean) * feature_inverse_stddev

with C.default_options(activation=C.sigmoid):
    z = C.layers.Sequential([
        C.layers.For(range(3), lambda: C.layers.Recurrence(C.layers.LSTM(1024))),
        C.layers.Dense(label_dimension)
    ])(feature_normalized)

Define training hyperparameters

CTC criteria (loss) function is implemented by combination of the labels_to_graph and forward_backward functions. These functions are designed to generalize forward-backward viterbi-like functions which are very common in sequential modelling problems, e.g. speech or handwriting. labels_to_graph is designed to convert the input label sequence into graph representation suitable for particular forward-backward procedure, and forward_backward function performs the procedure itself. Currently, these functions only support CTC, and it’s their default configuration.

In [4]:
mbsize = 1024
mbs_per_epoch = 10
max_epochs = 5

criteria = C.forward_backward(C.labels_to_graph(label), z, blankTokenId=132, delayConstraint=3)
err = C.edit_distance_error(z, label, squashInputs=True, tokensToIgnore=[132])
# Learning rate parameter schedule per sample:
# Use 0.01 for the first 3 epochs, followed by 0.001 for the remaining
lr = C.learning_parameter_schedule_per_sample([(3, .01), (1,.001)])
mm = C.momentum_schedule([(1000, 0.9), (0, 0.99)], mbsize)
learner = C.momentum_sgd(z.parameters, lr, mm)
trainer = C.Trainer(z, (criteria, err), learner)

Train

In [5]:
C.logging.log_number_of_parameters(z)
progress_printer = C.logging.progress_print.ProgressPrinter(tag='Training', num_epochs = max_epochs)

for epoch in range(max_epochs):
    for mb in range(mbs_per_epoch):
        minibatch = train_data_reader.next_minibatch(mbsize, input_map = train_input_map)
        trainer.train_minibatch(minibatch)
        progress_printer.update_with_trainer(trainer, with_metric = True)

    print('Trained on a total of ' + str(trainer.total_number_of_samples_seen) + ' frames')
    progress_printer.epoch_summary(with_metric = True)

# Uncomment to save the model
# z.save('CTC_' + str(max_epochs) + 'epochs_' + str(mbsize) + 'mbsize_' + str(mbs_per_epoch) + 'mbs.model')
Training 21255301 parameters in 11 parameter tensors.
Trained on a total of 8428 frames
Finished Epoch[1 of 5]: [Training] loss = 3.720116 * 8428, metric = 100.00% * 8428 25.106s (335.7 samples/s);
Trained on a total of 17094 frames
Finished Epoch[2 of 5]: [Training] loss = 3.513460 * 8666, metric = 98.07% * 8666 21.176s (409.2 samples/s);
Trained on a total of 25662 frames
Finished Epoch[3 of 5]: [Training] loss = 3.498874 * 8568, metric = 98.23% * 8568 21.978s (389.8 samples/s);
Trained on a total of 35282 frames
Finished Epoch[4 of 5]: [Training] loss = 3.512962 * 9620, metric = 98.23% * 9620 22.159s (434.1 samples/s);
Trained on a total of 43890 frames
Finished Epoch[5 of 5]: [Training] loss = 3.508142 * 8608, metric = 98.12% * 8608 19.864s (433.3 samples/s);

Evaluate

In [6]:
test_feature_filepath = "glob_0000.write.scp"
test_feature_stream = C.io.HTKFeatureDeserializer(
    C.io.StreamDefs(speech_feature = C.io.StreamDef(shape = feature_dimension, scp = test_feature_filepath)))
test_data_reader = C.io.MinibatchSource([test_feature_stream, train_label_stream], frame_mode = False)
test_input_map = {feature: test_data_reader.streams.speech_feature, label: test_data_reader.streams.speech_label}

num_test_minibatches = 2
test_result = 0.0
for i in range(num_test_minibatches):
    test_minibatch = test_data_reader.next_minibatch(mbsize, input_map = test_input_map)
    eval_error = trainer.test_minibatch(test_minibatch)
    test_result = test_result + eval_error

# Average of evaluation errors of all test minibatches
round(test_result / num_test_minibatches,2)
Out[6]:
0.99