The Project.yaml file
=====================

The ``project.yaml`` file has several sections, defining needed by the
workflow. The file includes comments that can be referred to for detailed
information about the workflow.

- **workflow**: This is metadata about the workflow and can be ignored
  but should be included.
- **project**: This section defines information and settings for the project.
- **datasets**: This section points the workflow to ``hic`` files that define
  the workflow. 
- **tracks**: This section points the workflow to ``track`` data that 
  can be painted on the 3D structures that are created.
- **annotations**: This section points to annotation files that can be used
  to select regions in the 4D Genome Browser. Either ``.gff`` or ``.csv``
  files can be used.
- **bookmarks**: This section defines features and locations of interest
  that can be quickly selected in the 4D Genome Browser

The workflow expects the data files defined in the ``project.yaml`` file to
exist, be well-formed, and contain data that can be cross-referenced per
the expectations of the tools.

Workflow section
----------------

The workflow section contains metadata about the workflow, most importantly
the version string. This section is not required, and the workflow will work
if this is not present.

.. code-block::

    workflow:
        version: "1.5.6"

Project section
---------------

.. code-block:: console

    project:
        name:               "your project name"
        chromosome:         "chr22"
        interval:           200000
        count_threshold:    2.0
        bond_coeff:         55
        blackout:
            - [1, 85]
       

This section contains parameters that can be tuned to control the behavior
of the workflow.

- **name**: a descriptive string for the project that is only used in this 
  file. Can be used to retain any information the user would like 
- **chromosome**: the chromosome to be viewed. This is expected to be present
  in the ``.hic`` data files provided in the ``datasets`` section.
- **interval**: the length of genetic material that is represented by each
  *bead* that is passed to the ``LAMMPS`` simulation, and which is shown in 
  the final visualization. The default value of 200,000 means that the
  input ``.hic`` data will be sampled at a 200KB resolution, and the number of 
  *beads* passed to the ``LAMMPS`` simulation (and represented in the 3D 
  structure and visualization) is:

.. math::

   (num.\ pairs\ in\ project\ chromosome)/project\ interval

- **count_threshold**: Parameter used for ``LAMMPS``. A threshold used in 
  computing values for input to the ``LAMMPS`` simulation. (Cullen: details) 
- **bond_coeff**: Parameter used for ``LAMMPS``. FENE bond coefficient used 
  in the ``LAMMPS`` simulation. If the ``LAMMPS`` run fails with a 
  "bad FENE bond" error, try increasing this value.
- **blackout**: A list of *bead* ID numbers that can be hidden in the 
  final visualization. These are determined by the user, but generally
  are used to hide long 'tails' of material that do not coalesce in the 
  final 3D structure due to a variety of factors.

**NOTE** bead IDs start at 1. This needs to be spelled out somewhere.

Datasets Section
----------------

This defines the datasets that are to be compared in the final browser.
The final visualization will show a comparative visualization between
the first (left window) and second (right window) datasets in the list.

.. code-block::

   datasets:
        - name: "some name"
          data: file/relative/to/project/directory
        - name: "some name"
          data: file/relative/to/project/directory

- **datasets**: a list of values describing the two required datasets.
    - **name**: a descriptive name for the dataset. Appears as a title for
      the 3D structure view in the browser. 
    - **data**: ``.hic`` file for a dataset. This must be contained in the 
      project directory.

Tracks Section
--------------

This defines track data that can be painted on the final 3D structure.

.. code-block::

    tracks:
        - name: "name of the track" 
          file: filename.csv 
          columns:
            - name: "name of the column" 
              file: (optional) filename.csv
            - name: "name of the column"
              file: (optional) filename.csv

- **tracks**: a list of values defining track data for the datasets
    - **name**: a descriptive name for the dataset. This will appear
      in the pulldown menu to select a track in the browser.
    - **file**: a csv file in the project directory. This is the default
      file that is searched for the columns below, unless the value is
      overridden by another file value.
    - **columns**: a list of values defining the files for the datasets
        - **name**: a string that is the name of a column in the source csv file.
        - **file (optional)**: the csv file to search for the name of this column.


Bookmarks section
-----------------

This defines data about bookmarks for the 4D Genome Browser UI. The bookmarks can
be either **locations** or **features**, and are defined as in these examples.

- **locations** a list of pairs of values. The first value is the start of the
  location, and the second value is the end of the location.
- **features** a list of strings, each of which is the name of an annotation.

.. code-block::

    bookmarks:
        locations:
            - [start, end]
            - [start, end]
            ...
        features:
            - namestring
            - "name string"

Annotations section
-------------------

This defines data about annotations that are available for selection in the 4D
Genome Browser UI. The user can define both ``gff`` and ``csv`` sources for
annotations. See the section on the ``features.csv`` file in the section on
file formats.

.. code-block::

    annotations:
        genes:
            file: "chr22.gff"
            description: "Your description or citation here"
        features:
            file: "features.csv"
            description: "Your description or citation here"