Program Synthesis Benchmarks
A Python library for downloading and reading the datasets from the Program Synthesis Benchmark Suite (v1 and v2). Downloaded datasets are stored as parquet, and read into pandas DataFrames.
This has no official release yet. The source code is made public for reasons of collaboration on the initial official release. Use at your own risk.
Rationale
The Program Synthesis Benchmark Suite is a set of general programming tasks used to evaluate program synthesis systems. The authors of the PSB suites provide canonical datasets of over 1 million labeled cases per problem for use in inductive program synthesis systems. These datasets have been used to standardize development and evaluation many methods.
A unique feature of the PSB suite dtasets is their complex schemas. Many problems involve complex data types, such as collection types, and occastionally specify both returned and printed (stdout) outputs. Additionally, a small subset of training cases (so called "edge cases") are considred critical to the evaluation of synthesized programs and are often handled differently during synthesis. For these reasons, and others implied below, the program_synthesis_benchmarks
library is tool that helps practitioners manage these datasets.
The program_synthesis_benchmarks
library provides functionality for downloading and reading data files. It uses parquet files (read as Pandas DataFrames) to take advantage of explicit schemas and complex column types. The library also caches data locally (at a location of your choosing) to avoid repeated downloads.
Prior Art
The predesessor of this library is psb2-python, a library for fetching PSB2 datasets from the same source. psb2-python
has a slightly different abstraction from program_synthesis_benchmarks
and there are a few trade-offs that motivated the creation of the new library.
psb-python |
program_synthesis_benchmarks |
|
---|---|---|
PSB1 | ||
PSB2 | ||
Representation | 3 "formats" that organize data into nested python collections. | Pandas DataFrame |
Sampling | Forces sampling and train-test split. | Reads entire dataset. |
Storage format | Uncompressed JSON lines files | Parquet |
Metadata - Data types | Parquet and DataFrame have explicit schema (dtypes). | |
Metadata - Edge cases | Includes indicators of human-written "edge cases" versus randomly generated cases. | |
Metadata - Return vs Stdout | Separate columns named output and stdout |
|
Concurrent Downloads | Serial | Up to 8 files in parallel |
This library is currently using JSON lines files while we work on getting parquet data files hosted.
The read-or-download pattern for libraries that provide data for experimentation was popularized by Penn Machine Learning Benchmarks. PMLB focuses on classification and regression tasks, while this library provides datasets for program synthesis.
Getting Started
Install program-synthesis-benchmarks
from pypi using pip. Using a virtual environment is recommended in most cases.
To get a frozenset of supported datasets, use ALL_PROBLEMS
. For just PSB1 or PSB2 use PSB1_PROBLEMS
or PSB2_PROBLEMS
respectively.
from program_synthesis_benchmarks import PSB1_PROBLEMS
print(PSB1_PROBLEMS)
# frozenset({'collatz-numbers', 'compare-string-lengths', 'count-odds', 'digits', ...
To download the parquet data files for a set of problem, provide a path and collection of datasets to download_datasets
.
from program_synthesis_benchmarks import download_datasets
download_datasets("./path/to/data", ["fizz-buzz", "gcd", "replace-space-with-newline"])
After all donwloads are complete, the file system will contain one directory per dataset containing 2 parquet files each. One file contains human written "edge" cases, the other contains 1 million randomly generated cases.
path
└── to
└── data
├── fizz-buzz
│ ├── fizz-buzz-edge.parquet
│ └── fizz-buzz-random.parquet
├── gcd
│ ├── gcd-edge.parquet
│ └── gcd-random.parquet
└── replace-space-with-newline
├── replace-space-with-newline-edge.parquet
└── replace-space-with-newline-random.parquet
To read the data of a specific problem, use read_dataset
. If you have previously download the data for this problem you can specify a cache_dir
to read from local storage. Otherwise the datasets will be downloaded.
from program_synthesis_benchmarks import read_dataset
read_dataset("gcd", cache_dir="./path/to/data")
# input1 input2 output edge_case
# 0 1 1 1 True
# 1 4 400000 4 True
# 2 54 24 6 True
# 3 4200 3528 168 True
# 4 820000 63550 2050 True
# ... ... ... ... ...
# 999995 793436 643541 1 False
# 999996 382449 108033 3 False
# 999997 910646 435802 2 False
# 999998 347104 474860 4 False
# 999999 375006 743332 2 False
#
# [1000006 rows x 4 columns]
More detailed documentation can be found in the API docs.
Citation
If you use these datasets in a publication, please cite the paper PSB2: The Second Program Synthesis Benchmark Suite.
BibTeX entry for paper:
@InProceedings{Helmuth:2021:GECCO,
author = "Thomas Helmuth and Peter Kelly",
title = "{PSB2}: The Second Program Synthesis Benchmark Suite",
booktitle = "2021 Genetic and Evolutionary Computation Conference",
series = {GECCO '21},
year = "2021",
isbn13 = {978-1-4503-8350-9},
address = {Lille, France},
size = {10 pages},
doi = {10.1145/3449639.3459285},
publisher = {ACM},
publisher_address = {New York, NY, USA},
month = {10-14} # jul,
doi-url = {https://doi.org/10.1145/3449639.3459285},
URL = {https://dl.acm.org/doi/10.1145/3449639.3459285},
}