Project Definition

Most of the project definition happens in a project folder in the checked out copy of OpenStudio-BuildStock. However, for this library to work, a separate project YAML file provides the details needed for the batch run. An example file is in this repo as project_resstock_national.yml as shown below.

schema_version: 0.2
stock_type: residential
buildstock_directory: ../OpenStudio-BuildStock  # Relative to this file or absolute
project_directory: project_national  # Relative to buildstock_directory
output_directory: ../national_test_outputs
# weather_files_path: ../  # Relative to this file or absolute path to zipped weather files

  n_datapoints: 4
  n_buildings_represented: 133172057  # Total number of residential dwelling units in contiguous United States, including unoccupied units, resulting from acensus tract level query of ACS 5-yr 2016 (i.e. 2012-2016), using this script:
  sampling_algorithm: quota # The default resstock sampling algorithm - use precomputed if using the precomputed_sample option

  - upgrade_name: Triple-Pane Windows
      - option: Windows|Low-E, Triple, Non-metal, Air, L-Gain
#        apply_logic:
          - value: 45.77
            multiplier: Window Area (ft^2)
        lifetime: 30

  reporting_frequency: Hourly
  include_enduse_subcategories: true

# downselect: # Uncomment and set specify logic you you want to downselect to a subset of the building stock
#   resample: true
#   logic:
#     - Geometry Building Type RECS|Single-Family Detached
#     - Vacancy Status|Occupied

  n_jobs: 4
  minutes_per_sim: 30
  account: enduse
    time: 20
    time: 60
    n_workers: 3

# aws:
#   # The job_identifier should be unique, start with alpha, not include dashes, and limited to 10 chars or data loss can occur
#   job_identifier: test_proj
#   s3:
#     bucket: resbldg-datasets
#     prefix: testing/user_test
#   emr:
#     slave_instance_count: 1
#   region: us-west-2
#   use_spot: true
#   batch_array_size: 10
#   # To receive email updates on job progress accept the request to receive emails that will be sent from Amazon
#   notifications_email:

# postprocessing:
#   aggregate_timeseries: true
#   aws:
#     region_name: 'us-west-2'
#     s3:
#       bucket: resbldg-datasets
#       prefix: resstock-athena/calibration_runs_new
#     athena:
#       glue_service_role: service-role/AWSGlueServiceRole-default
#       database_name: testing
#       max_crawling_time: 300 #time to wait for the crawler to complete before aborting it

The next few paragraphs will describe each section of the file and what it does.

Reference the project

First we tell it what project we’re running with the following keys:

  • buildstock_directory: The absolute (or relative to this YAML file) path of the OpenStudio-BuildStock repository.

  • project_directory: The relative (to the buildstock_directory) path of the project.

  • schema_version: The version of the project yaml file to use and validate - currently the minimum version is 0.2.

Weather Files

Each batch of simulations depends on a number of weather files. These are provided in a zip file. This can be done with one of the following keys:

  • weather_files_url: Where the zip file of weather files can be downloaded from.

  • weather_files_path: Where on this machine to find the zipped weather files. This can be absolute or relative (to this file).

Weather files for Typical Meteorological Years (TMYs) can be obtained from the NREL data catalog.

Historical weather data for Actual Meteorological Years (AMYs) can be purchased in EPW format from various private companies. NREL users of buildstock batch can use NREL-owned AMY datasets by setting weather_files_url to a zip file located on Box.

Custom Weather Files

To use your own custom weather files for a specific location, this can be done in one of two ways:

  • Rename the filename references in your local options_lookup.tsv in the resources folder to match your custom weather file names. For example, in the options_lookup tsv, the Location AL_Birmingham.Muni.AP.722280 is matched to the weather_file_name=USA_AL_Birmingham.Muni.AP.722280.epw. To update the weather file for this location, the weather_file_name field needs to be updated to match your new name specified.

  • Rename your custom .epw weather file to match the references in your local options_lookup.tsv in the resources folder.

Baseline simulations incl. sampling algorithm

Information about baseline simulations are listed under the baseline key.

  • sampling_algorithm: The sampling algorithm to use for this project - the default residential option is quota, the default commercial option is sobol (this is not supported for residential projects), or if using a previously computed buildstock.csv file use the precomputed sampler.

  • n_datapoints: The number of buildings to sample and run for the baseline case if using the sobol or quota sampling algorithms.

  • n_buildings_represented: The number of buildings that this sample is meant to represent.

  • precomputed_sample: Filepath of csv containing pre-defined building options to use in the precomputed sampling routine. This can be absolute or relative (to this file).

  • skip_sims: Include this key to control whether the set of baseline simulations are run. The default (i.e., when this key is not included) is to run all the baseline simulations. No results csv table with baseline characteristics will be provided when the baseline simulations are skipped.

  • measures_to_ignore: ADVANCED FEATURE (USE WITH CAUTION–ADVANCED USERS/WORKFLOW DEVELOPERS ONLY) to optionally not run one or more measures (specified as a list) that are referenced in the options_lookup.tsv but should be skipped during model creation. The measures are referenced by their directory name. This feature is currently only implemented for residential models constructed with the BuildExistingModel measure.

  • custom_gems: VERY ADVANCED FEATURE - NOT SUPPORTED - ONLY ATTEMPT USING SINGULARITY CONTAINERS ON EAGLE This activates the bundle and bundle_path interfaces in the OpenStudio CLI to allow for custom gem packages (needed to support rapid development of the standards gem.) This actually works extraordinarily well if the singularity image is properly configured but that’s easier said than done. The la100 branch on the nrel/docker-openstudio repo is a starting place if this is required.

  • osw_template: An optional key allowing for switching of which workflow generator to use within the commercial or residential classes.

  • include_qaqc: An optional flag - only configured for commercial at the moment - which when set to True runs some additional measures that check a number of key (and often incorrectly configured) part of the simulation inputs as well as providing additional model QAQC data streams on the output side. Recommended for test runs but not production analyses.

OpenStudio Version Overrides

This is a feature only used by ComStock at the moment. Please refer to the ComStock HPC documentation for additional details on the correct configuration. This is noted here to explain the presence of two keys in the version 0.2 schema: os_version and os_sha.

Residential Simulation Controls

If the key residential_simulation_controls is in the project yaml file, the parameters to the ResidentialSimulationControls measure will be modified from their defaults to what is specified there. The defaults are:

            'timesteps_per_hr': 6,
            'begin_month': 1,
            'begin_day_of_month': 1,
            'end_month': 12,
            'end_day_of_month': 31,
            'calendar_year': 2007

Upgrade Scenarios

Under the upgrades key is a list of upgrades to apply with the following properties:

  • upgrade_name: (required) The name that will be in the outputs for this upgrade scenario.

  • options: A list of options to apply as part of this upgrade.

    • option: (required) The option to apply, in the format parameter|option which can be found in options_lookup.tsv in OpenStudio-BuildStock.

    • apply_logic: Logic that defines which buildings to apply the upgrade to. See Filtering Logic for instructions.

    • costs: A list of costs for the upgrade. Multiple costs can be entered and each is multiplied by a cost multiplier, described below.

      • value: A cost for the measure, which will be multiplied by the multiplier.

      • multiplier: The cost above is multiplied by this value, which is a function of the buiding. Since there can be multiple costs, this permits both fixed and variable costs for upgrades that depend on the properties of the baseline building. The multiplier needs to be from this enumeration list in OpenStudio-BuildStock or from the list in your branch of that repo.

    • lifetime: Lifetime in years of the upgrade.

  • package_apply_logic: (optional) The conditions under which this package of upgrades should be performed. See Filtering Logic.

  • reference_scenario: (optional) The upgrade_name which should act as a reference to this upgrade to calculate savings. All this does is that reference_scenario show up as a column in results csvs alongside the upgrade name; Buildstockbatch will not do the savings calculation.

Simulation Annual Outputs Options

Include the simulation_output key to optionally include annual totals for end use subcategories (i.e., interior equipment broken out by end use) along with the usual annual simulation results. This argument is passed directly into the SimulationOutputReport measure in OpenStudio-BuildStock. Please refer to the measure argument there to determine what to set it to in your config file. Note that this measure and presence of any arguments may be different depending on which version of OpenStudio-BuildStock you’re using. The best thing you can do is to verify that it works with what is in your branch.

Time Series Export Options

Include the timeseries_csv_export key to include hourly or subhourly results along with the usual annual simulation results. These arguments are passed directly to the TimeseriesCSVExport measure in OpenStudio-BuildStock. Please refer to the measure arguments there to determine what to set them to in your config file. Note that this measure and arguments may be different depending on which version of OpenStudio-BuildStock you’re using. The best thing you can do is to verify that it works with what is in your branch.

Additional Reporting Measures

Include the reporting_measures key along with a list of reporting measure names to apply additional reporting measures (that require no arguments) to the workflow. Any columns reported by these additional measures will be appended to the results csv.

Output Directory

output_directory specifies where the outputs of the simulation should be stored.

Down Selecting the Sampling Space

Sometimes it is desirable to run a stock simulation of a subset of what is included in a project. For instance one might want to run the simulation only in one climate region or for certain vintages. However, it can be a considerable effort to create a new project. Adding the downselect key to the project file permits a user to specify filters of what buildings should be simulated.

Downselecting can be performed in one of two ways: with and without resampling. Downselecting with resampling samples twice, once to determine how much smaller the set of sampled buildings becomes when it is filtered down and again with a larger sample so the final set of sampled buildings is at or near the number specified in n_datapoints.

Downselecting without resampling skips that step. In this case the total sampled buildings returned will be the number left over after sampling the entire stock and then filtering down to the buildings that meet the criteria. Unlike downselect with resampling, downselect without resampling can be used with buildstock.csv too. So, instead of starting with a fresh set of samples and filtering them based on the dowselect logic, the sampler starts with the buildstock.csv provided, and filters out the buildings using the downselect logic.

The downselect block works as follows:

  resample: true
    - Heating Fuel|Natural Gas
    - Location Region|CR02

For details on how to specify the filters, see Filtering Logic.

Filtering Logic

There are several places where logic is applied to filter simulations by the option values. This is done by specifying the parameter|option criteria you want to include or exclude along with the appropriate logical operator. This is done in the YAML syntax as follows:


To include certain parameter option combinations, specify them in a list or by using the and key.

- Vintage|1950s
- Location Region|CR02
  - Vintage|1950s
  - Location Region|CR02

The above example would include buildings in climate region 2 built in the 1950s. A list, except for that inside an or block is always interpreted as and block.


  - Vintage|<1950
  - Vintage|1950s
  - Vintage|1960s

This example would include buildings built before 1970.


not: Heating Fuel|Propane

This will select buildings that does not have Propane Fuel type.

  - Vintage|1950s
  - Location Region|CR02

This will select buildings that are not both Vintage 1950s and in location region CR02. It should be noted that this will select buildings of 1950s vintage provided they aren’t in region CR02. It will also select buildings in location CR02 provided they aren’t of vintage 1950s. If only those buildings that are neither of Vintage 1950s nor in region CR02 needs to be selected, the following logic should be used:

- not: Vintage|1950s
- not: Location Region|CR02


  - not: Vintage|1950s
  - not: Location Region|CR02


    - Vintage|1950s
    - Location Region|CR02

Combining Logic

These constructs can be combined to declare arbitrarily complex logic. Here is an example:

- or:
  - Vintage|<1950
  - Vintage|1950s
  - Vintage|1960s
- not: Geometry Garage|3 Car
- not: Geometry House Size|3500+
- Geometry Stories|1

This will select homes that were built before 1970, don’t have three car garages, are less than 3500 sq.ft., and have only one storey.

Eagle Configuration

Under the eagle key is a list of configuration for running the batch job on the Eagle supercomputer.

  • n_jobs: Number of eagle jobs to parallelize the simulation into

  • minutes_per_sim: Maximum allocated simulation time in minutes

  • account: Eagle allocation account to charge the job to

  • sampling: Configuration for the sampling in eagle

    • time: Maximum time in minutes to allocate to sampling job

  • postprocessing: Eagle configuration for the postprocessing step

    • time: Maximum time in minutes to allocate postprocessing job

    • n_workers: Number of eagle workers to parallelize the postprocessing job into. Max supported is 32.

    • node_memory_mb: The memory (in MB) to request for eagle node for postprocessing. The valid values are

      85248, 180224 and 751616. Default is 85248.

    • parquet_memory_mb: The size (in MB) of the combined parquet file in memory. Default is 40000.

    • keep_intermediate_files: Set this to true if you want to keep postprocessing intermediate files (for debugging

      or other explorative purpose). The intermediate files contain results_job*.json.gz files and individual building’s timeseries parquet files. Default is false.

AWS Configuration

The top-level aws key is used to specify options for running the batch job on the AWS Batch service.


Many of these options overlap with options specified in the Postprocessing section. The options here take pecedence when running on AWS. In a future version we will break backwards compatibility in the config file and have more consistent options.

  • job_identifier: A unique string that starts with an alphabetical character, is up to 10 characters long, and only has letters, numbers or underscore. This is used to name all the AWS service objects to be created and differentiate it from other jobs.

  • s3: Configuration for project data storage on s3. When running on AWS, this overrides the s3 configuration in the Postprocessing Configuration Options.

    • bucket: The s3 bucket this project will use for simulation output and processed data storage.

    • prefix: The s3 prefix at which the data will be stored.

  • region: The AWS region in which the batch will be run and data stored.

  • use_spot: true or false. Defaults to false if missing. This tells the project to use the Spot Market for data simulations, which typically yields about 60-70% cost savings.

  • spot_bid_percent: Percent of on-demand price you’re willing to pay for your simulations. The batch will wait to run until the price drops below this level.

  • batch_array_size: Number of concurrent simulations to run. Max: 10000.

  • notifications_email: Email to notify you of simulation completion. You’ll receive an email at the beginning where you’ll need to accept the subscription to receive further notification emails.

  • emr: Optional key to specify options for postprocessing using an EMR cluster. Generally the defaults should work fine.

    • master_instance_type: The instance type to use for the EMR master node. Default: m5.xlarge.

    • slave_instance_type: The instance type to use for the EMR worker nodes. Default: r5.4xlarge.

    • slave_instance_count: The number of worker nodes to use. Same as eagle.postprocessing.n_workers. Increase this for a large dataset. Default: 2.

    • dask_worker_vcores: The number of cores for each dask worker. Increase this if your dask workers are running out of memory. Default: 2.

  • job_environment: Specifies the computing requirements for each simulation.

    • vcpus: Number of CPUs needed. default: 1.

    • memory: Amount of RAM memory needed for each simulation in MiB. default 1024. For large multifamily buildings this works better if set to 2048.


After a batch of simulation completes, to analyze BuildStock results the individual simulation results are aggregated in a postprocessing step as follows:

  1. The inputs and annual outputs of each simulation are gathered together into one table for each upgrade scenario. In older versions that ran on PAT, this was known as the results.csv. This table is now made available in both csv and parquet format.

  2. Time series results for each simulation are gathered and concatenated into fewer larger parquet files that are better suited for querying using big data analysis tools.

    For ResStock runs with the ResidentialScheduleGenerator, the generated schedules are horizontally concatenated with the time series files before aggregation, making sure the schedule values are properly lined up with the timestamps in the same way that Energeyplus handles ScheduleFiles.

Uploading to AWS Athena

BuildStock results can optionally be uploaded to AWS for further analysis using Athena. This process requires appropriate access to an AWS account to be configured on your machine. You will need to set this up wherever you use buildstockbatch. If you don’t have keys, consult your AWS administrator to get them set up.

Postprocessing Configuration Options


The region_name and s3 info here are ignored when running buildstock_aws. The configuration is defined in AWS Configuration.

The configuration options for postprocessing and AWS upload are:

  • postprocessing: postprocessing configuration

    • aws: configuration related to uploading to and managing data in amazon web services. For this to work, please configure aws. Including this key will cause your datasets to be uploaded to AWS, omitting it will cause them not to be uploaded.

      • region_name: The name of the aws region to use for database creation and other services.

      • s3: Configurations for data upload to Amazon S3 data storage service.

        • bucket: The s3 bucket into which the postprocessed data is to be uploaded to

        • prefix: S3 prefix at which the data is to be uploaded. The complete path will become: s3://bucket/prefix/output_directory_name

      • athena: configurations for Amazon Athena database creation. If this section is missing/commented-out, no Athena tables are created.

        • glue_service_role: The data in s3 is catalogued using Amazon Glue data crawler. An IAM role must be present for the user that grants rights to Glue crawler to read s3 and create database catalogue. The name of that IAM role must be provided here. Default is: “service-role/AWSGlueServiceRole-default”. For help, consult the AWS documentation for Glue Service Roles.

        • database_name: The name of the Athena database to which the data is to be placed. All tables in the database will be prefixed with the output directory name.

        • max_crawling_time: The maximum time in seconds to wait for the glue crawler to catalogue the data before aborting it.