Walkthrough 1: Logging and Reading Data¶
Note: All of the code for this walkthrough is available here
In this walkthrough, we’ll be starting with the following simple piece of code, which tries to finds the minimum of a quadratic function:
import sys
def f(x):
return (x - 2.03)**2 + 3
x = ...
tol = ...
step = ...
for _ in range(1000):
# Take a uniform step in the direction of decrease
if f(x + step) < f(x - step):
x += step
else:
x -= step
# If the difference between the directions
# is less than the tolerance, stop
if f(x + step) - f(x - step) < tol:
break
Initializing stores¶
Logging in Cox is done through the Store
class, which can be created as follows:
from cox.store import Store
# rest of program here...
store = Store(OUT_DIR)
Upon construction, the Store
instance creates a directory with a random uuid
generated name in OUT_DIR
, a HDFStore
for storing data, some logging
files, and a tensorboard directory (named tensorboard
). Therefore, after we
run this command, our OUT_DIR
directory should look something like this:
$ ls OUT_DIR
7753a944-568d-4cc2-9bb2-9019cc0b3f49
$ ls 7753a944-568d-4cc2-9bb2-9019cc0b3f49
save store.h5 tensorboard
The experiment ID string 7753a944-568d-4cc2-9bb2-9019cc0b3f49
was
autogenerated. If we wanted to name the experiment something else, we could pass
it as the second parameter; i.e. making a store with Store(OUT_DIR, 'exp1')
would make the corresponding experiment ID exp1
.
Creating tables¶
The next step is to declare the data we want to store via _tables_. We can add arbitrary tables according to our needs, but we need to specify the structure ahead of time by passing the schema. In our case, we will start out with just a simple metadata table containing the parameters used to run an instance of the program above, along with a table for writing the result:
store.add_table('metadata', {
'step_size': float,
'tolerance': float,
'initial_x': float,
'out_dir': str
})
store.add_table('result', {
'final_x': float,
'final_opt':float
})
Each table corresponds exactly to a Pandas dataframe found in an HDFStore
object.
A note on serialization¶
Cox supports basic object types (like float
, int
, str
, etc) along with any
kind of serializable object (via dill
or using PyTorch’s serialization
method). In particular, if we want to serialize an object we can pass one of the
following types: cox.store.[OBJECT|PICKLE|PYTORCH_STATE]
as the type value
that is mapped to in the schema dictionary. cox.store.PYTORCH_STATE
is
particularly useful for dealing with PyTorch objects like model weights.
In detail: OBJECT
corresponds to storing the object as a
serialized string in the table, PICKLE
corresponds to storing the object as a
serialized string on disk in a separate file, and PYTORCH_STATE
corresponds to
storing the object as a serialized string on disk using torch.save
. Note
that saving large objects using OBJECT
is not
recommended as it will adversely affect loading times.
Logging¶
Now that we have a table, we can write rows to it! Logging in Cox is done in a row-by-row manner: at any time, there is a working row that can be appended to/updated; the row can then be flushed (i.e. written to the file), which starts a new (empty) working row. The relevant commands are:
# This updates the working row, but does not write it permenantly yet!
store['result'].update_row({
"final_x": 3.0
})
# This updates it again
store['result'].update_row({
"final_opt": 3.9409
})
# Write the row permenantly, and start a new working row!
store['result'].flush_row()
# A shortcut for appending a row directly
store['metadata'].append_row({
'step_size': 0.01,
'tolerance': 1e-6,
'initial_x': 1.0,
'out_dir': '/tmp/'
})
Incremental updates with update_row¶
Subsequent calls to update_row()
will edit the same working row.
This is useful if different parts of the row are computed in different
functions/locations in the code, as it removes the need for passing statistics
around all over the place.
Reading data¶
By populating tables rows, we are really just adding rows to an underlying
HDFStore
table. If we want to read the store later, we can simply open another
store at the same location, and then read dataframes with simple commands:
# Note that EXP_ID is the directory the store wrote to in OUT_DIR
s = Store(OUT_DIR, EXP_ID)
# Read tables we wrote earlier
metadata = s['metadata'].df
result = s['result'].df
print(result)
Inspecting the result
table, we see the expected result in our Pandas dataframe!:
final_x final_opt
0 3.000000 3.940900
CollectionReader: Reading many experiments at once¶
Now, in our quadratic example, we aren’t just going to try one set of
parameters, we are going to try a number of different values for step_size
,
tolerance
, and initial_x
(we haven’t yet discovered convex optimization).
To do this, we just run the code above a bunch of times with the desired
hyperparameters, supplying the same OUT_DIR
for all of the runs (recall
that cox
will automatically create different, uuid
-named folders inside
OUT_DIR
for each experiment).
Imagine that we have done so (using any standard tool, e.g. sbatch in SLURM, sklearn grid search, or even a for loop like in our example file), and that we have a directory full of stores:
$ ls $OUT_DIR
drwxr-xr-x 6 engstrom 0424807a-c9c0-4974-b881-f927fc5ae7c3
...
...
drwxr-xr-x 6 engstrom e3646fcf-569b-46fc-aba5-1e9734fedbcf
drwxr-xr-x 6 engstrom f23d6da4-e3f9-48af-aa49-82f5c017e14f
Now, we want to collect all the results from this directory. We can use
cox.readers.CollectionReader
to read all the tables together in a concatenated
pandas
table.:
from cox.readers import CollectionReader
reader = CollectionReader(OUT_DIR)
print(reader.df('result'))
Which gives us all the result
tables concatenated together as a Pandas
DataFrame
for easy manipulation:
final_x final_opt exp_id
0 1.000000 4.060900 ed892c4f-069f-4a6d-9775-be8fdfce4713
0 0.000010 7.120859 44ea3334-d2b4-47fe-830c-2d13dc0e7aaa
...
...
0 2.000000 3.000900 f031fc42-8788-4876-8c96-2c1237ceb63d
0 -14.000000 259.960900 73181d27-2928-48ec-9ac6-744837616c4b
pandas
has a ton of powerful utilities for searching through and
manipulating DataFrames. We recommend looking at their docs for
information on how to do this. For convenience, we’ve given a few simple
examples below:
df = reader.df('result')
m_df = reader.df('metadata')
# Filter by experiments have step_size less than 1.0
exp_ids = set(m_df[m_df['step_size'] < 1.0]['exp_id].tolist())
print(df[df['exp_id'].isin(exp_ids)]) # The filtered DataFrame
# Finding which experiment has the lowest final_opt
exp_id = df[df['final_opt'] == min(df['final_opt'].tolist())]['exp_id'].tolist()[0]
print(m_df[m_df['exp_id'] == exp_id]) # Metadata of the best experiment