Project Definition

Most of the project definition happens in a project folder in the checked out copy of the ResStock or ComStock repo as well as a project configuration file that defines how the project will be run. The project file (also known as the “yaml” file) is the primary input for buildstockbatch. Some examples of project files are:

Similar project files can be found in the ComStock repo.

The next few paragraphs will describe each section of the project file and what it does.

Reference the project

First we tell it what project we’re running with the following keys:

buildstock_directory: The absolute (or relative to this YAML file) path of the ResStock or ComStock repository.
project_directory: The relative (to the buildstock_directory) path of the project.
schema_version: The version of the project yaml file to use and validate - currently the minimum version is 0.3.

Weather Files

Each batch of simulations depends on a number of weather files. These are provided in a zip file. This can be done with one of the following keys:

weather_files_url: Where the zip file of weather files can be downloaded from.
weather_files_path: Where on this machine to find the zipped weather files. This can be absolute or relative (to this file).

Weather files for Typical Meteorological Years (TMYs) can be obtained from the NREL data catalog.

Historical weather data for Actual Meteorological Years (AMYs) can be purchased in EPW format from various private companies. NREL users of buildstock batch can use NREL-owned AMY datasets by setting weather_files_url to a zip file located on Box.

Custom Weather Files

To use your own custom weather files for a specific location, this can be done in one of two ways:

Rename the filename references in your local options_lookup.tsv in the resources folder to match your custom weather file names. For example, in the options_lookup.tsv, the County AL, Autauga County is matched to the weather_station_epw_filepath=../../../G0100010.epw. To update the weather file for this location, the weather_station_epw_filepath field needs to be updated to match your new name specified.
Rename your custom .epw weather file to match the references in your local options_lookup.tsv in the resources folder.

References

This is a throwaway section where you can define YAML anchors so that these can be used elsewhere in the yaml. Things defined here have no impact in the simulation and is purely used for anchor definitions.

Sampler

The sampler key defines the type of building sampler to be used for the batch of simulations. The purpose of the sampler is to enumerate the buildings and building characteristics to be run as part of the building stock simulation. It has two arguments, type and args.

type tells buildstockbatch which sampler class to use.
args are passed to the sampler to define how it is run.

Different samplers are available for ResStock and ComStock and variations thereof. Details of what samplers are available and their arguments are available at Samplers.

Workflow Generator

The workflow_generator key defines which workflow generator will be used to transform a building description from a set of high level characteristics into an OpenStudio workflow that in turn generates the detailed EnergyPlus model. It has two arguments, type and args.

type tells buildstockbatch which workflow generator class to use.
args are passed to the sampler to define how it is run.

Different workflow generators are available for ResStock and ComStock and variations thereof. Details of what workflow generators and their arguments are available at Workflow Generators.

Baseline simulations incl. sampling algorithm

Information about baseline simulations are listed under the baseline key.

n_buildings_represented: The number of buildings that this sample is meant to represent.
skip_sims: Include this key to control whether the set of baseline simulations are run. The default (i.e., when this key is not included) is to run all the baseline simulations. No results csv table with baseline characteristics will be provided when the baseline simulations are skipped.
custom_gems: true or false. ONLY WORKS ON KESTREL AND LOCAL When true, buildstockbatch will call the OpenStudio CLI commands with the bundle and bundle_path options. These options tell the CLI to load a custom set of gems rather than those included in the OpenStudio CLI. For both Kestrel and local Docker runs, these gems are first specified in the buildstock\resources\Gemfile. For Kestrel, when the apptainer image is built, these gems are added to the image. For local Docker, when the containers are started, the gems specified in the Gemfile are installed into a Docker volume on the local computer. This volume is mounted by each container as models are run, so each run uses the custom gems.

OpenStudio Version

The following two top level keys are required:

os_version: The version of OpenStudio required for the run (e.g. “3.7.0”). BuildStockBatch will verify that a suitable version of OpenStudio is available and return an error if not.
os_sha: The sha hash of the required OpenStudio version (e.g. “06d9d975e1”). This must match the sha of the matching OpenStudio release.

Upgrade Scenarios

Under the upgrades key is a list of upgrades to apply with the following properties:

upgrade_name: (required) The name that will be in the outputs for this upgrade scenario.
options: A list of options to apply as part of this upgrade.
- option: (required) The option to apply, in the format parameter|option which can be found in options_lookup.tsv in ResStock.
- apply_logic: Logic that defines which buildings to apply the upgrade to. See Filtering Logic for instructions.
- costs: A list of costs for the upgrade. Multiple costs can be entered and each is multiplied by a cost multiplier, described below.
  - value: A cost for the measure, which will be multiplied by the multiplier.
  - multiplier: The cost above is multiplied by this value, which is a function of the building. Since there can be multiple costs, this permits both fixed and variable costs for upgrades that depend on the properties of the baseline building. The multiplier needs to be from this enumeration list in the resstock repo or this enumeration list in the comstock repo or from the list in your branch of that repo.
- lifetime: Lifetime in years of the upgrade.
package_apply_logic: (optional) The conditions under which this package of upgrades should be performed. See Filtering Logic.
reference_scenario: (optional) The upgrade_name which should act as a reference to this upgrade to calculate savings. All this does is that reference_scenario show up as a column in results csvs alongside the upgrade name; Buildstockbatch will not do the savings calculation.

Output Directory

output_directory: specifies where the outputs of the simulation should be stored. The last folder in the path will be used as the table name in Athena (if aws configuration is present under postprocessing) so needs to be lowercase, start from letters and contain only letters, numbers and underscore character. Athena requirement.

Kestrel Configuration

Under the kestrel key is a list of configuration for running the batch job on the Kestrel supercomputer.

n_jobs: Number of kestrel jobs to parallelize the simulation into
minutes_per_sim: Required. Maximum allocated simulation time in minutes.
account: Required. kestrel allocation account to charge the job to.
sampling: Configuration for the sampling in kestrel
- time: Maximum time in minutes to allocate to sampling job
postprocessing: kestrel configuration for the postprocessing step
- time: Maximum time in minutes to allocate postprocessing job
- n_workers: Number of kestrel nodes to parallelize the postprocessing job into. Max supported is 32. Default is 2.
- n_procs: Number of CPUs to use within each kestrel nodes. Max is 104. Default is 52. Try reducing this if you get OOM error.
- node_memory_mb: The memory (in MB) to request for kestrel node for postprocessing. The default is 250000, which is a standard node.
- parquet_memory_mb: The size (in MB) of the combined parquet file in memory. Default is 1000.

AWS Configuration

The top-level aws key is used to specify options for running the batch job on the AWS Batch service.

Note

Many of these options overlap with options specified in the Postprocessing section. The options here take precedence when running on AWS. In a future version we will break backwards compatibility in the config file and have more consistent options.

job_identifier: (required) A unique string that starts with an alphabetical character, is up to 10 characters long, and only has letters, numbers or underscore. This is used to name all the AWS service objects to be created and differentiate it from other jobs.
s3: (required) Configuration for project data storage on s3. When running on AWS, this overrides the s3 configuration in the Postprocessing Configuration Options.
- bucket: The s3 bucket this project will use for simulation output and processed data storage.
- prefix: The s3 prefix at which the data will be stored.
region: (required) The AWS region in which the batch will be run and data stored. Probably “us-west-2” if you’re at NREL.
use_spot: (optional) true or false. Defaults to true if missing. This tells the project to use the Spot Market for data simulations, which typically yields about 60-70% cost savings.
spot_bid_percent: (optional) Percent of on-demand price you’re willing to pay for your simulations. The batch will wait to run until the price drops below this level. Usually leave this one blank.
batch_array_size: (required) Number of concurrent simulations to run. Max: 10,000. Unless this is a small run with fewer than 100,000 simulations, just set this to 10,000.
notifications_email: (required) Email to notify you of simulation completion. You’ll receive an email at the beginning where you’ll need to accept the subscription to receive further notification emails. This doesn’t work right now.
dask: (required) Dask configuration for postprocessing
- n_workers: (required) Number of dask workers to use.
- scheduler_cpu: (optional) One of [1024, 2048, 4096, 8192, 16384]. Default: 2048. CPU to allocate for the scheduler task. 1024 = 1 VCPU. See Fargate Task CPU and memory for allowable combinations of CPU and memory.
- scheduler_memory: (optional) Amount of memory to allocate to the scheduler task. Default: 8192. See Fargate Task CPU and memory for allowable combinations of CPU and memory.
- worker_cpu: (optional) One of [1024, 2048, 4096, 8192, 16384]. Default: 2048. CPU to allocate for the worker tasks. 1024 = 1 VCPU. See Fargate Task CPU and memory for allowable combinations of CPU and memory.
- worker_memory: (optional) Amount of memory to allocate to the worker tasks. Default: 8192. See Fargate Task CPU and memory for allowable combinations of CPU and memory.
job_environment: Specifies the computing requirements for each simulation.
- vcpus: (optional) Number of CPUs needed. Default: 1. This probably doesn’t need to be changed.
- memory: (optional) Amount of RAM memory needed for each simulation in MiB. default 1024. For large multifamily buildings this works better if set to 2048.
tags: (optional) This is a list of key-value pairs to attach as tags to all the AWS objects created in the process of running the simulation. If you are at NREL, please fill out the following tags so we can track and allocate costs: billingId, org, and owner.

GCP Configuration

The top-level gcp key is used to specify options for running the batch job on GCP, using GCP Batch and Cloud Run.

Note

When BuildStockBatch is run on GCP, it will only save results to GCP Cloud Storage (using the gcs configuration below); i.e., it currently cannot save to AWS S3 and Athena. Likewise, buildstock run locally, on Kestrel, or on AWS cannot save to GCP.

job_identifier: A unique string that starts with an alphabetical character, is up to 48 characters long, and only has lowercase letters, numbers and/or hyphens. This is used to name the GCP Batch and Cloud Run jobs to be created and differentiate them from other jobs.
project: The GCP Project ID in which the job will run.
service_account: Optional. The service account email address to use when running jobs on GCP. Default: the Compute Engine default service account of the GCP project.
gcs: Configuration for project data storage on GCP Cloud Storage.
- bucket: The Cloud Storage bucket this project will use for simulation output and processed data storage.
- prefix: The Cloud Storage prefix at which the data will be stored within the bucket.
- upload_chunk_size_mib: Optional. The size of data chunks used when uploading files to GCS, in MiB. If your network environment produces a TimeoutError when uploading project files, reducing this may help. Default: 40 MiB
region: The GCP region in which the job will be run and the region of the Artifact Registry. (e.g. us-central1)
batch_array_size: Number of tasks to divide the simulations into. Tasks with fewer than 100 simulations each are recommended when using spot instances, to minimize lost/repeated work when instances are preempted. Max: 10,000.
parallelism: Optional. Maximum number of tasks that can run in parallel. If not specified, uses GCP’s default behavior (the lesser of batch_array_size and job limits). Parallelism is also limited by Compute Engine quotas and limits (including vCPU quota).
artifact_registry: Configuration for Docker image storage in GCP Artifact Registry.
- repository: The name of the GCP Artifact Repository in which Docker images are stored. This will be combined with the project and region to build the full URL to the repository.
job_environment: Optional. Specifies the computing requirements for each simulation.
- vcpus: Optional. Number of CPUs to allocate for running each simulation. Default: 1.
- memory_mib: Optional. Amount of RAM memory to allocate for each simulation in MiB. Default: 1024
- boot_disk_mib: Optional. Extra boot disk size in MiB for each task. This affects how large the boot disk will be (see the Batch OS environment docs for details) of the machine(s) running simulations (which is the disk used by simulations). This will likely need to be set to at least 2,048 if more than 8 simulations will be run in parallel on the same machine (i.e., when vCPUs per machine_type ÷ vCPUs per sim > 8). Default: None (which should result in a 30 GB boot disk according to the docs linked above).
- machine_type: Optional. GCP Compute Engine machine type to use. If omitted, GCP Batch will choose a machine type based on the requested vCPUs and memory. If set, the machine type should have at least as many resources as requested for each simulation above. If it is large enough, multiple simulations will be run in parallel on the same machine. Typically this is a type from the E2 series. Usually safe to leave unset.
- use_spot: Optional. Whether to use Spot VMs for data simulations, which can reduce costs by up to 91%. Default: false
- minutes_per_sim: Optional. Maximum time per simulation. Default works well for ResStock, but this should be increased for ComStock. Default: 3 minutes
postprocessing_environment: Optional. Specifies the Cloud Run computing environment for postprocessing.
- cpus: Optional. Number of CPUs to use. Use up to 8 for large jobs. Default: 2.
- memory_mib: Optional. Amount of RAM needed in MiB. At least 2048 MiB per CPU is recommended. Use up to 32768 MiB for large jobs. Default: 4096 MiB.

Postprocessing

After a batch of simulation completes, to analyze BuildStock results the individual simulation results are aggregated in a postprocessing step as follows:

The inputs and annual outputs of each simulation are gathered together into one table for each upgrade scenario. In older versions that ran on PAT, this was known as the results.csv. This table is now made available in both csv and parquet format.
Time series results for each simulation are gathered and concatenated into fewer larger parquet files that are better suited for querying using big data analysis tools.

For ResStock runs with the ResidentialScheduleGenerator, the generated schedules are horizontally concatenated with the time series files before aggregation, making sure the schedule values are properly lined up with the timestamps in the same way that EnergyPlus handles ScheduleFiles.

Uploading to AWS Athena

BuildStock results can optionally be uploaded to AWS for further analysis using Athena. This process requires appropriate access to an AWS account to be configured on your machine. You will need to set this up wherever you use buildstockbatch. If you don’t have keys, consult your AWS administrator to get them set up. The appropriate keys are already installed on Kestrel, so no action is required. If you run on AWS, this step is already done since the simulation outputs are already on S3.

Postprocessing Configuration Options

Warning

The region_name and s3 info here are ignored when running buildstock_aws. The configuration is defined in AWS Configuration.

The configuration options for postprocessing and AWS upload are:

postprocessing: postprocessing configuration
- keep_individual_timeseries: (optional, bool) For some use cases it is useful to keep the timeseries output for each simulation as a separate parquet file. Setting this option to true allows that. Default is false.
- partition_columns: (optional, list) Allows partitioning the output data based on some columns. The columns must match the parameters found in options_lookup.tsv. This allows for efficient athena queries. Only recommended for moderate or large sized runs (ndatapoints > 10K)
- publish_annual_results: (optional, bool) When set to true, additional processed annual results will be generated in both CSV and Parquet formats. For residential projects, this functionality uses the resstockpostproc module’s publishing functions to transform the data. The processed results are stored in a results_csvs_pub directory and a pub_annual subdirectory within the parquet directory. Default is false.
- aws: (optional) configuration related to uploading to and managing data in amazon web services. For this to work, please configure aws. Including this key will cause your datasets to be uploaded to AWS, omitting it will cause them not to be uploaded.
  region_name: The name of the aws region to use for database creation and other services.
  
  s3: Configurations for data upload to Amazon S3 data storage service.
  
  bucket: The s3 bucket into which the postprocessed data is to be uploaded to
  
  prefix: S3 prefix at which the data is to be uploaded. The complete path will become: s3://bucket/prefix/output_directory_name
  
  athena: configurations for Amazon Athena database creation. If this section is missing/commented-out, no Athena tables are created.
  
  glue_service_role: The data in s3 is catalogued using Amazon Glue data crawler. An IAM role must be present for the user that grants rights to Glue crawler to read s3 and create database catalogue. The name of that IAM role must be provided here. Default is: “service-role/AWSGlueServiceRole-default”. For help, consult the AWS documentation for Glue Service Roles.
  
  database_name: The name of the Athena database to which the data is to be placed. All tables in the database will be prefixed with the output directory name. Database name must be lowercase, start from letters and contain only letters, numbers and underscore character. Athena requirement.
  
  max_crawling_time: The maximum time in seconds to wait for the glue crawler to catalogue the data before aborting it.