The Project.yaml file

The project.yaml file has several sections, defining needed by the workflow. The file includes comments that can be referred to for detailed information about the workflow.

workflow: This is metadata about the workflow and can be ignored but should be included.
project: This section defines information and settings for the project.
datasets: This section points the workflow to hic files that define the workflow.
tracks: This section points the workflow to track data that can be painted on the 3D structures that are created.
annotations: This section points to annotation files that can be used to select regions in the 4D Genome Browser. Either .gff or .csv files can be used.
bookmarks: This section defines features and locations of interest that can be quickly selected in the 4D Genome Browser

The workflow expects the data files defined in the project.yaml file to exist, be well-formed, and contain data that can be cross-referenced per the expectations of the tools.

Workflow section

The workflow section contains metadata about the workflow, most importantly the version string. This section is not required, and the workflow will work if this is not present.

workflow:
    version: "1.5.6"

Project section

project:
    name:               "your project name"
    chromosome:         "chr22"
    interval:           200000
    count_threshold:    2.0
    bond_coeff:         55
    blackout:
        - [1, 85]

This section contains parameters that can be tuned to control the behavior of the workflow.

name: a descriptive string for the project that is only used in this file. Can be used to retain any information the user would like
chromosome: the chromosome to be viewed. This is expected to be present in the .hic data files provided in the datasets section.
interval: the length of genetic material that is represented by each bead that is passed to the LAMMPS simulation, and which is shown in the final visualization. The default value of 200,000 means that the input .hic data will be sampled at a 200KB resolution, and the number of beads passed to the LAMMPS simulation (and represented in the 3D structure and visualization) is:

\[(num.\ pairs\ in\ project\ chromosome)/project\ interval\]

count_threshold: Parameter used for LAMMPS. A threshold used in computing values for input to the LAMMPS simulation. (Cullen: details)
bond_coeff: Parameter used for LAMMPS. FENE bond coefficient used in the LAMMPS simulation. If the LAMMPS run fails with a “bad FENE bond” error, try increasing this value.
blackout: A list of bead ID numbers that can be hidden in the final visualization. These are determined by the user, but generally are used to hide long ‘tails’ of material that do not coalesce in the final 3D structure due to a variety of factors.

NOTE bead IDs start at 1. This needs to be spelled out somewhere.

Datasets Section

This defines the datasets that are to be compared in the final browser. The final visualization will show a comparative visualization between the first (left window) and second (right window) datasets in the list.

datasets:
     - name: "some name"
       data: file/relative/to/project/directory
     - name: "some name"
       data: file/relative/to/project/directory

datasets: a list of values describing the two required datasets.
- name: a descriptive name for the dataset. Appears as a title for the 3D structure view in the browser.
- data: .hic file for a dataset. This must be contained in the project directory.

Tracks Section

This defines track data that can be painted on the final 3D structure.

tracks:
    - name: "name of the track"
      file: filename.csv
      columns:
        - name: "name of the column"
          file: (optional) filename.csv
        - name: "name of the column"
          file: (optional) filename.csv

tracks: a list of values defining track data for the datasets
- name: a descriptive name for the dataset. This will appear in the pulldown menu to select a track in the browser.
- file: a csv file in the project directory. This is the default file that is searched for the columns below, unless the value is overridden by another file value.
- columns: a list of values defining the files for the datasets
  
  name: a string that is the name of a column in the source csv file.
  
  file (optional): the csv file to search for the name of this column.

Bookmarks section

This defines data about bookmarks for the 4D Genome Browser UI. The bookmarks can be either locations or features, and are defined as in these examples.

locations a list of pairs of values. The first value is the start of the location, and the second value is the end of the location.
features a list of strings, each of which is the name of an annotation.

bookmarks:
    locations:
        - [start, end]
        - [start, end]
        ...
    features:
        - namestring
        - "name string"

Annotations section

This defines data about annotations that are available for selection in the 4D Genome Browser UI. The user can define both gff and csv sources for annotations. See the section on the features.csv file in the section on file formats.

annotations:
    genes:
        file: "chr22.gff"
        description: "Your description or citation here"
    features:
        file: "features.csv"
        description: "Your description or citation here"