How to Save Task Outputs

There are two ways to save the output of a task to an S3 location.

  • To save a task's output to a GBDX S3 location, use the "Persist" flag and set the "persistLocation."

  • To save a task's output to a personal S3 location, use the "SaveToS3" task. This task requires temporary AWS credentials.

This documentation explains how to use each of these methods to save a task's output files to the appropriate location.

Last Updated: June 25, 2019

Table of Contents

How to save task outputs to a GBDX S3 location
How to Save Task Outputs to a personal S3 location

How to save task outputs to a GBDX S3 location

Use the persist flag with persistLocation to save a task's output to a directory or subdirectory within a GBDX S3 location. A GBDX S3 location is the GBDX S3 bucket name and the prefix.

To save a non-string task output to a GBDX S3 location, add the persist flag to the task output port in the workflow definition, where the output port is a directory type 'output port.' When there are multiple output ports for a task, set the persist flag for each one if the output from that port should be saved.

To specify the location where the non-string output file will be saved, use persistLocation. If no location is specified, the file output will be saved to the default location (explained below). The persistLocation is a relative path to a directory or subdirectory within a GBDX S3 location.

NameDescription
persistTo save an the output for a directory type output port, the output port descriptor persist:true must be set. If this descriptor is not provided, or the value is set to false, the output file will not be saved and will be lost.
persistLocationSpecify a directory name to save the file to using persistLocation. If no name is specified, the file will be saved to the default location. Only the directory and subdirectory names should be specified. Full path locations are not supported. This method does not support saving output files to a personal S3 location. See How to save task outputs to a personal S3 location.

This diagram represents a workflow with two tasks. Both tasks have 3 outputs.
Persist flag Persist flag

Note: Output ports have types. These types are set in the task definition. The types are "string" and non-string. Non-string outputs can be files or directories. String type outputs cannot be saved to an S3 location.

Step-by-Step

Note: gbdxtools users, see Saving Output Data to S3.

  1. Define the tasks to run in the workflow definition.

  2. For each task, review the output ports:
    A. If the output from a port will only be used as input to the subsequent task, do nothing.
    B. If the output from a non-string type port needs to be saved to an S3 location, move on to Step

  3. To save the output from a non-string type port, add "persist": true to the output port descriptor.

  4. To specify the name of the directory the output files should be saved to, use "persistLocation" and add directory name (do not specify the full path). If a location is not specified, the output will be saved to the default location.

How to Set the "Persist" flag on a non-string type output port

To save the output from a task, set the "persist" flag as an output port descriptor for any non-string output that should be saved.
This example shows an output port with the descriptors "name" and "persist".

     "outputs": [{
                "name": "outputfile",
                "persist": true

If "persist": true is not set, the output file will not be saved, and cannot be retrieved later.

Default Output file Location

If "persist":true is set for an output, but persistLocation is not specified, the output will be saved to the default location.

By default, the task output will be saved to the following location:
s3://gbd-customer-data//workflow_output////.

How to Specify the Output file location

You can specify the directory path where the non-string output should be saved. This is done by adding persistLocation to the output port descriptors under persist": true. It may be easier to find the output files with a deeper directory structure. This method creates a directory within a GBDX S3 location. You cannot save data to a personal S3 location using this method.

In this example, the output port type is not string. (Output port types are set in the task definition, not the workflow definition).

"outputs": [{
                "name": "outputfile",
                "persist": true
                "persistLocation": "specify_name"

Do not include the full path to the S3 location. The directory you specify will automatically be prepended with the the GBDX S3 location, which is:
s3://gbd-customer-data/

The output will be saved to this location:
s3://gbd-customer-data//<specify_name>/
Subdirectories should be separated by a single forward slash. Double forward slashes will return a 400 error.

For example, if a persistLocation of "Task_1/output_dir" is set, and the user's account ID is 734875da-2059-42lz-ad90-03e4o5198fz6,
the output will be saved to this full-path location:
s3://gbd-customer-data/734875da-2059-42lz-ad90-03e4o5198fz6enter code here/Task_1/output_dir/<file_name>

To see the full GBDX S3 location path after the workflow has been submitted, make a request to the Workflow Status endpoint with the workflow ID.

Business Rules

  1. "Persist" and "persistLocation" should be set on each directory type output port of a task if that port's output should be saved. This is also true for multiplex output ports. If the "persist" flag is not set on a directory type output port, the output from that port will not be saved.

  2. If "persist" = true, but no "persistLocation" is set, the default location, will be used. The default location is:

s3://gbd-customer-data/<account id>/workflow_output/<workflow id>/<task name>/<output port name>/<file>
  1. "persistLocation" should not be used alone. The file will only be saved to a specified location if the persist flag is set to "true" and persistLocation is specified.

  2. If the same persistLocation is used for several outputs, data in the folder will be accumulated. If there are conflicts, the data will be overwritten with the latest file. If two tasks run in parallel with the same outputs and the same persist location, the behavior is undefined.

Example

This example shows the persist and persistLocation descriptors on a task's output port. This is part of a workflow definition.

{
   "name": "test_persist",
   "tasks": [{
       "name": "task_1",
       "taskType": "test-string-to-file",
       "inputs": [{
           "name": "inputstring",
           "value": "for the demo!!!"
       }],
       "outputs": [{
           "name": "outputfile",
           "persist": true,
           "persistLocation": "for_demo"
       }]
   }]
}

Example Workflow response with persistLocation

This is the JSON response from the workflow request:

The persist location shown in the response is accountID/persistLocation name. The full path would be:

     s3://gbd-customer-data/7b216bd3-6523-4ca5-aa3b-1d8a5994f052/for_demo/filename

JSON

{
 "tasks": [
   {
     "inputs": [
       {
         "source": null,
         "dataflow_channel": null,
         "type": "string",
         "name": "inputstring",
         "value": "for the demo!!!"
       }
     ],
     "outputs": [
       {
         "persistLocation": "7b216bd3-6523-4ca5-aa3b-1d8a5994f052/for_demo",
         "dataflow_channel": null,
         "name": "outputfile",
         "value": null,
         "source": null,
         "type": "textfile"
       }
     ],
     "start_time": null,
     "taskType": "test-string-to-file",
     "id": "4458592653599309096",
     "name": "task_1",
     "note": null,
     "callback": null,
     "state": {
       "state": "pending",
       "event": "submitted"
     },
     "run_parameters": {
       "mounts": [
         {
           "read_only": false,
           "local": "/mnt/glusterfs",
           "container": "/mnt/glusterfs"
         },
         {
           "read_only": false,
           "local": "$task_data_dir",
           "container": "/mnt/work"
         }
       ],
       "image": "tdgp/test-string-to-file",
       "command": "",
       "devices": []
     },
     "timeout": 7200,
     "instance_info": {
       "domain": "default"
     }
   }
 ],
 "completed_time": null,
 "callback": null,
 "state": {
   "state": "pending",
   "event": "submitted"
 },
 "submitted_time": "2016-11-03T21:27:42.632147+00:00",
 "owner": "workflow owner's name",
 "id": "4458592653599519360"
}

Access the saved Output Files

Use your preferred method to access and download the output files from the S3 location. See the S3 Access Course for more information.

Error Conditions

  1. If "persist":true is not set for an output port descriptor, the file will not be saved.

  2. If a persistLocation is set. but persist does not equal "true", a 400 error will be returned.

  3. If invalid characters are used in the "persistLocation" name, a 400 error will be returned.

How to Save Task Outputs to a Personal S3 Location

To save a task's output files to a personal S3 location, run the "SaveToS3" task at the end of your workflow. This task requires temporary S3 credentials.

The Docker image for this task is auto-built on dockerhub here: https://hub.docker.com/r/tdgp/gbdx-task-stagedatatos3/

Quickstart Example

For more information on the python-based tool suite, GBDXtools, see GBDXtools .

PYTHON

from gbdxtools import Interface
gbdx = Interface()

savedata = gbdx.Task('SaveToS3')
savedata.inputs.destination = 's3://your-bucket/somewhere_nice'
savedata.inputs.access_key_id = '<your-s3-access-key>'
savedata.inputs.secret_key = '<your-s3-secret-key>'
savedata.inputs.session_token = '<your-session-token>'
savedata.inputs.data = 'some-input-data-from-s3-or-another-task-output'

wf = gbdx.Workflow([savedata])
wf.execute()

Example Getting Temporary Credentials

This example shows running a workflow and then using temporary AWS credentials to save data to a personal S3 location.

PYTHON

# First we'll run atmospheric compensation on Landsat8 data
from gbdxtools import Interface
gbdx = Interface()

acomp = gbdx.Task('AComp', data='s3://landsat-pds/L8/033/032/LC80330322015035LGN00')

# Now we'll save the result to a personal S3 location.  First we need to generate temporary AWS credentials
# (this assumes you have an AWS account and your IAM credentials are appropriately accessible via boto:
# env vars or aws config file)
import boto3
client = boto3.client('sts')
response = client.get_session_token(DurationSeconds=86400)
access_key_id = response['Credentials']['AccessKeyId']
secret_key = response['Credentials']['SecretAccessKey']
session_token = response['Credentials']['SessionToken']

# Save the data to your personal s3 location using the SaveToS3 task:
savetask = gbdx.Task('SaveToS3')
savetask.inputs.data = acomp.outputs.data.value
savetask.inputs.destination = "s3://your-bucket/your-path/"
savetask.inputs.access_key_id = access_key_id
savetask.inputs.secret_key = secret_key
savetask.inputs.session_token = session_token

workflow = gbdx.Workflow([acomp, savetask])
workflow.execute()

"SaveToS3" Task Definition

JSON


{
    "containerDescriptors": [
        {
            "command": "",
            "type": "DOCKER",
            "properties": {
                "image": "tdgp/gbdx-task-stagedatatos3:latest"
            }
        }
    ],
    "description": "Save data to a personal S3 location.  You must generate temporary AWS credentials & supply them as inputs.",
    "inputPortDescriptors": [
        {
            "required": true,
            "type": "directory",
            "description": "The source directory",
            "name": "data"
        },
        {
            "required": true,
            "type": "string",
            "description": "AWS Access Key ID that is authorized to push to the s3 location specified in 'directory'.",
            "name": "access_key_id"
        },
        {
            "required": true,
            "type": "string",
            "description": "AWS Secret Access Key.",
            "name": "secret_key"
        },
        {
            "required": true,
            "type": "string",
            "description": "AWS Session Token.  Required as part of temporary credentials.",
            "name": "session_token"
        },
        {
            "required": true,
            "type": "string",
            "description": "full S3 URL where the data will be written.",
            "name": "destination"
        }
    ],
    "version": "0.0.7",
    "outputPortDescriptors": [],
    "taskOwnerEmail": "[email protected]",
    "properties": {
        "isPublic": true,
        "timeout": 7200
    },
    "name": "SaveToS3"
}