API
The program_synthesis_benchmarks
library provides immutable (fozen) sets of the dataset names which it can download and read. There are sets for each version of the PSB suite and a set that provides the union of all suites.
import program_synthesis_benchmarks as psb
print(psb.PSB1_PROBLEMS)
# frozenset({'collatz-numbers', 'compare-string-lengths', 'count-odds', 'digits', ...
print(psb.PSB2_PROBLEMS)
# frozenset({'basement', 'bouncing-balls', 'bowling', 'camel-case', 'coin-sums', ...
print(psb.ALL_PROBLEMS)
# frozenset({'collatz-numbers', 'compare-string-lengths', 'count-odds', 'digits', ...
The rest of the API is composed of two functions documented below.
download_datasets(local_dir, datasets)
Downloads edge and random data files for all datasets to the local directory.
Will download up to 8 files in parallel. If fewer than 8 core are available, the number of files downloaded in parallel will equal the number of cores.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
local_dir |
Union[str, Path]
|
The directory under which to download dataset files. Will be created if it does not exist. |
required |
datasets |
Iterable[str]
|
The names of the datasets to download. |
required |
Source code in program_synthesis_benchmarks/__init__.py
read_dataset(dataset, *, cache_dir=None, force_download=False)
Reads a dataset into a DataFrame
.
If cache_dir
is not None
, reading the dataset from the corresponding sub-directory is attempted first.
If no files are found, a copy of the dataset is downloaded and stored in a dataset specific sub-directory
of the cache_dir
and then these files are read into a DataFrame. If cache_dir=None
, data files will be downloaded
to a temporary
If force_download
is True
, the data will be downloaded regardless if existing files exsit in the cache directory.
This is useful when repairing data files or picking up changes that may have been made to the source data.
Files in the cache directory may be overwritten. force_download
is ignored if cache_dir=None
.
Please cache your data!
The providers of this (free) data kindly ask that you avoid repeated re-downloads by using a cache_dir
whenever possible.
This data changes extremely rarely. Runtimes will be faster for you (and hosting costs lower for the provider) if you
download once and read the data from local storage.
The returned DataFrame will have the following schema
- One column per program input named
input1
,input2
, and so on. Datatypes vary by dataset. output
- The expected returned output (aka the label). Datatypes vary by dataset.stdout
- Optional. The expected printed output. String type.edge_case
- A boolean indicator. True if the case is a human-written "edge case".
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
str
|
The name of the dataset to download and/or read into a DataFrame. |
required |
cache_dir |
Optional[Union[str, Path]]
|
The directory of the local filesystem to store the downloaded copy of the data in. |
None
|
force_download |
bool
|
Forces the download of a fresh copy of the data, regardless of what is already in |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. Each row is a labeled training case. |