Walkthrough 1: Logging and Reading Data¶

Note: All of the code for this walkthrough is available here

In this walkthrough, we’ll be starting with the following simple piece of code, which tries to finds the minimum of a quadratic function:

import sys

def f(x):
    return (x - 2.03)**2 + 3

x = ...
tol = ...
step = ...

for _ in range(1000):
    # Take a uniform step in the direction of decrease
    if f(x + step) < f(x - step):
        x += step
    else:
        x -= step

 # If the difference between the directions
 # is less than the tolerance, stop
 if f(x + step) - f(x - step) < tol:
     break

Initializing stores¶

Logging in Cox is done through the Store class, which can be created as follows:

from cox.store import Store
# rest of program here...
store = Store(OUT_DIR)

Upon construction, the Store instance creates a directory with a random uuid generated name in OUT_DIR, a HDFStore for storing data, some logging files, and a tensorboard directory (named tensorboard). Therefore, after we run this command, our OUT_DIR directory should look something like this:

$ ls OUT_DIR
7753a944-568d-4cc2-9bb2-9019cc0b3f49
$ ls 7753a944-568d-4cc2-9bb2-9019cc0b3f49
save        store.h5    tensorboard

The experiment ID string 7753a944-568d-4cc2-9bb2-9019cc0b3f49 was autogenerated. If we wanted to name the experiment something else, we could pass it as the second parameter; i.e. making a store with Store(OUT_DIR, 'exp1') would make the corresponding experiment ID exp1.

Creating tables¶

The next step is to declare the data we want to store via _tables_. We can add arbitrary tables according to our needs, but we need to specify the structure ahead of time by passing the schema. In our case, we will start out with just a simple metadata table containing the parameters used to run an instance of the program above, along with a table for writing the result:

store.add_table('metadata', {
  'step_size': float,
  'tolerance': float,
  'initial_x': float,
  'out_dir': str
})

store.add_table('result', {
    'final_x': float,
    'final_opt':float
})

Each table corresponds exactly to a Pandas dataframe found in an HDFStore object.

A note on serialization¶

Cox supports basic object types (like float, int, str, etc) along with any kind of serializable object (via dill or using PyTorch’s serialization method). In particular, if we want to serialize an object we can pass one of the following types: cox.store.[OBJECT|PICKLE|PYTORCH_STATE] as the type value that is mapped to in the schema dictionary. cox.store.PYTORCH_STATE is particularly useful for dealing with PyTorch objects like model weights. In detail: OBJECT corresponds to storing the object as a serialized string in the table, PICKLE corresponds to storing the object as a serialized string on disk in a separate file, and PYTORCH_STATE corresponds to storing the object as a serialized string on disk using torch.save. Note that saving large objects using OBJECT is not recommended as it will adversely affect loading times.

Logging¶

Now that we have a table, we can write rows to it! Logging in Cox is done in a row-by-row manner: at any time, there is a working row that can be appended to/updated; the row can then be flushed (i.e. written to the file), which starts a new (empty) working row. The relevant commands are:

# This updates the working row, but does not write it permenantly yet!
store['result'].update_row({
  "final_x": 3.0
})

# This updates it again
store['result'].update_row({
  "final_opt": 3.9409
})

# Write the row permenantly, and start a new working row!
store['result'].flush_row()

# A shortcut for appending a row directly
store['metadata'].append_row({
  'step_size': 0.01,
  'tolerance': 1e-6,
  'initial_x': 1.0,
  'out_dir': '/tmp/'
})

Incremental updates with update_row¶

Subsequent calls to update_row() will edit the same working row. This is useful if different parts of the row are computed in different functions/locations in the code, as it removes the need for passing statistics around all over the place.

Reading data¶

By populating tables rows, we are really just adding rows to an underlying HDFStore table. If we want to read the store later, we can simply open another store at the same location, and then read dataframes with simple commands:

# Note that EXP_ID is the directory the store wrote to in OUT_DIR
s = Store(OUT_DIR, EXP_ID)

# Read tables we wrote earlier
metadata = s['metadata'].df
result = s['result'].df

print(result)

Inspecting the result table, we see the expected result in our Pandas dataframe!:

     final_x   final_opt
0   3.000000   3.940900

CollectionReader: Reading many experiments at once¶

Now, in our quadratic example, we aren’t just going to try one set of parameters, we are going to try a number of different values for step_size, tolerance, and initial_x (we haven’t yet discovered convex optimization). To do this, we just run the code above a bunch of times with the desired hyperparameters, supplying the same OUT_DIR for all of the runs (recall that cox will automatically create different, uuid-named folders inside OUT_DIR for each experiment).

Imagine that we have done so (using any standard tool, e.g. sbatch in SLURM, sklearn grid search, or even a for loop like in our example file), and that we have a directory full of stores:

$ ls $OUT_DIR
drwxr-xr-x  6 engstrom  0424807a-c9c0-4974-b881-f927fc5ae7c3
...
...
drwxr-xr-x  6 engstrom  e3646fcf-569b-46fc-aba5-1e9734fedbcf
drwxr-xr-x  6 engstrom  f23d6da4-e3f9-48af-aa49-82f5c017e14f

Now, we want to collect all the results from this directory. We can use cox.readers.CollectionReader to read all the tables together in a concatenated pandas table.:

from cox.readers import CollectionReader
reader = CollectionReader(OUT_DIR)
print(reader.df('result'))

Which gives us all the result tables concatenated together as a Pandas DataFrame for easy manipulation:

     final_x   final_opt                                exp_id
0   1.000000    4.060900  ed892c4f-069f-4a6d-9775-be8fdfce4713
0   0.000010    7.120859  44ea3334-d2b4-47fe-830c-2d13dc0e7aaa
...
...
0   2.000000    3.000900  f031fc42-8788-4876-8c96-2c1237ceb63d
0 -14.000000  259.960900  73181d27-2928-48ec-9ac6-744837616c4b

pandas has a ton of powerful utilities for searching through and manipulating DataFrames. We recommend looking at their docs for information on how to do this. For convenience, we’ve given a few simple examples below:

df = reader.df('result')
m_df = reader.df('metadata')

# Filter by experiments have step_size less than 1.0
exp_ids = set(m_df[m_df['step_size'] < 1.0]['exp_id].tolist())
print(df[df['exp_id'].isin(exp_ids)]) # The filtered DataFrame

# Finding which experiment has the lowest final_opt
exp_id = df[df['final_opt'] == min(df['final_opt'].tolist())]['exp_id'].tolist()[0]
print(m_df[m_df['exp_id'] == exp_id]) # Metadata of the best experiment