data

Data reading and wrangling functionality

Synthetic data

Except for the first and last lines, everything else comes from Rubanova’s implementation (comments mine)


source

make_periodic_dataset

 make_periodic_dataset (timepoints:int, extrap:bool, max_t:float, n:int,
                        noise_weight:float)
Type Details
timepoints int Number of time instants
extrap bool Whether extrapolation is peformed
max_t float Maximum value of time instants
n int Number of examples
noise_weight float Standard deviation of the noise to be added
time, observations = make_periodic_dataset(timepoints=100, extrap=True, max_t=5.0, n=200, noise_weight=0.01)
time.shape, observations.shape
(torch.Size([101]), torch.Size([200, 101, 1]))

PyTorch

A class defining a (somehow complex) collate function for a PyTorch DataLoader


source

CollateFunction

 CollateFunction (time:torch.Tensor, n_points_to_subsample=None)

Initialize self. See help(type(self)) for accurate signature.

Type Default Details
time Tensor Time axis [time]
n_points_to_subsample NoneType None Number of points to be “subsampled”

Let us build an object for testing

collate_fn = CollateFunction(time, n_points_to_subsample=50)
collate_fn
Collate function expecting time series of length 101, with the second half to be predicted from the first.

We also need a PyTorch DataLoader

dataloader = torch.utils.data.DataLoader(observations, batch_size = 10, shuffle=False, collate_fn=collate_fn)
dataloader
<torch.utils.data.dataloader.DataLoader>

How many batches is this DataLoader providing?

n_batches = len(dataloader)
n_batches
20

Let us get the first batch

batch_bundle = next(iter(dataloader))
type(batch_bundle)
dict

Notice that, as seen from CollateFunction.__call__ function’s prototype, the type is returned is a dictionary. It contains the following fields

print(batch_bundle.keys())
dict_keys(['observed_time', 'observed_data', 'to_predict_at_time', 'to_predict_data', 'observed_mask'])
  • observed_time and observed_data is the first part of a time series we want to learn, whereas
  • to_predict_at_time, to_predict_data is the second part of the same time series we aim at predicting; on the other hand
  • observed_mask is True for every observation that is available (it only applies to the observed data)

If one must think of this in terms of an input, \(x\), that is given, and a related output, \(y\), that is to be predicted, the latter would be to_predict_data and the former would encompass the rest of the fields.

We can check the size of every component

for k, v in batch_bundle.items():
    print(f'Dimensions of {k}: {tuple(v.shape)}')
Dimensions of observed_time: (50,)
Dimensions of observed_data: (10, 50, 1)
Dimensions of to_predict_at_time: (51,)
Dimensions of to_predict_data: (10, 51, 1)
Dimensions of observed_mask: (10, 50, 1)

In this simple example, every observatios is available

(batch_bundle['observed_mask'] == 1.).all()
tensor(True)

GPU support

If one wants to move this object to another device, this function will do that for all the relevant internal state.


source

CollateFunction.to

 CollateFunction.to (device)