Dataset How-to#

KnowIt uses a specific datastructure to process time series data. This guide explains how to import and manage new raw data.

1. Importing new datasets using Pandas#

In order to train models on your data you will need to convert it into a specific format for KnowIt to understand. Your data will need to be compiled into a pandas.Dataframe that meets a number of criteria. It can then be imported using the KnowIt.import_dataset(kwarg={'data_import': {'raw_data': <the dataframe>, ...}}) function. You can also provide the data as a path to a pickled dataframe using KnowIt.import_dataset(kwarg={'data_import': {'raw_data': <path to pickle>, ...}}).

The criteria are as follows:

Must be time indexed. (with a pandas.DatetimeIndex, not strings)
Must contain the following metadata in the Dataframe.attrs dictionary, or alternatively passed with the ‘meta’ argument.
- name (str): The name of the dataset to be constructed.
- components (list): The components to be stored in the datasets.
- time_delta (pandas.Timedelta, datetime.timedelta): The time difference between any two consecutive time points.
- instances (list, Optional): A list of the instances names to be stored in the datasets.
Must contain no all-NaN columns.
Must contain column headers corresponding to the components defined in the metadata.

If instances are desired, they must be defined in the metadata and a corresponding column header ‘instance’ must be present in the dataframe. This column contains no NaNs, and indicates what instance each time step (row) corresponds to. If no instances are define all time steps will be assumed to belong to one single instance.

Similarly, if a custom data split is to be defined (for future training) it should be defined in a separate column in the dataframe labeled ‘split’. This column should only contain set indicators: 0 (train), 1 (valid), or 2 (eval).

The resulting datastructure will be stored under /custom_datasets in the relevant custom experiment output directory. It can then be used to train a model by passing kwargs={'data': {'name': <data name>, ...}, ...} when calling the KI.train_model function.

2. Useful functions#

Use KI.available_datasets() to see what datasets are available after importing.

Use KI.summarize_dataset() to see a summary of your new dataset as it is imported.

3. Default datasets#

While newly imported datasets are stored under /custom_datasets in the relevant custom experiment output directory, default datasets are stored under KnowIt/default_datasets.