GBDX

How to Save Task Outputs

There are two ways to save the output of a task to an S3 location.

  • To save a task's output to the GBDX customer S3 bucket, use the "Persist" flag and set the "persistLocation."

  • To save a task's output to a personal S3 bucket, use the new "SaveToS3" task. This task requires temporary AWS credentials.

This documentation explains how to use each of these methods to save a task's output files to the appropriate location.

Table of Contents

How to save task outputs to the GBDX customer S3 bucket
How to Save Task Outputs to a personal S3 bucket

How to save task outputs to the GBDX customer S3 bucket

Use the persist flag with persistLocation to save a task's output to a directory or subdirectory within the GBDX customer s3 bucket.

To save a non-string task output to the GBDX customer S3 location, add the persist flag to the task output port in the workflow definition, where the output port is a directory type output port. When there are multiple output ports for a task, set the persist flag for each one if the output from that port should be saved.

To specify the location where the non-string output file will be saved, use persistLocation. If no location is specified, the file output will be saved to the default location (explained below). The persistLocation is a relative path to a directory or subdirectory within the GBDX customer S3 bucket.

Name Description
persist To save an the output for a directory type output port, the output port descriptor persist:true must be set. If this descriptor is not provided, or the value is set to false, the output file will not be saved and will be lost.
persistLocation Specify the location the output file will be staged to on S3. If no location is specified, the file will be saved to the default location. Only the directory and subdirectory names should be specified. Full path locations are not supported. This method does not support saving output files to a personal S3 location. See How to save task outputs to a personal S3 bucket.

This diagram represents a workflow with two tasks. Both tasks have 3 outputs.
Persist flag

Note: Output ports have types. These types are set in the task definition. The types are "string" and non-string. Non-string outputs can be files or directories. String type outputs cannot be saved to an S3 location.

Step-by-Step

Note: gbdxtools users, see Saving Output Data to S3.

  1. Define the tasks to run in the Workflow Definition.

  2. For each task, review the output ports:
    A. If the output from a port will only be used as input to the subsequent task, do nothing.
    B. If the output from a non-string type port needs to be saved to an S3 location, move on to Step

  3. To save the output from a non-string type port, add "persist": true to the output port descriptor.

  4. To save the output to a specific location within the GBDX customer S3 bucket, add "persistLocation" and specify the directory name (do not specify the full path). If a location is not specified, the output will be saved to the default location.

How to Set the "Persist" flag on a non-string type output port

To save the output from a task, set the "persist" flag as an output port descriptor for any non-string output that should be saved.
This example shows an output port with the descriptors "name" and "persist".

     "outputs": [{
                "name": "outputfile",
                "persist": true

If "persist": true is not set, the output file will not be saved, and cannot be retrieved later.

Default Output file Location

If "persist":true is set for an output, but persistLocation is not specified, the output will be saved to the default location.

By default, the task output will be saved to the following location:
s3://gbd-customer-data/<account id>/workflow_output/<workflow id>/<task name>/<output port name>/<file>.

How to Specify the Output file location

You can specify a location or directory path where the non-string output should be saved. This is done by adding persistLocation to the output port descriptors under persist": true. It may be easier to find the output files with a deeper directory structure. This location must be within the GBDX S3 customer bucket. You cannot save data to a personal S3 bucket using this method.

To specify the output location, add persistLocation to the output port with the persist flag.

In this example, the output port type is not string. (Output port types are set in the task definition, not the workflow definition).

"outputs": [{
                "name": "outputfile",
                "persist": true
                "persistLocation": "specify_name"

Do not include the full path to the S3 location. The directory you specify will automatically be prepended with:
s3://gbd-customer-data/<account id>

The output will be saved to this location:
s3://gbd-customer-data/<account id>/<specify_name>/
Subdirectories should be separated by a single forward slash. Double forward slashes will return a 400 error.

For example, if a persistLocation of "Task_1/output_dir" is set, and the user's account ID is 734875da-2059-42lz-ad90-03e4o5198fz6,
the output will be saved to this full-path location:
s3://gbd-customer-data/734875da-2059-42lz-ad90-03e4o5198fz6enter code here/Task_1/output_dir/<file_name>

To see the full S3 location path after the workflow has been submitted, make a request to the Workflow Status endpoint with the workflow ID.

Business Rules

  1. "Persist" and "persistLocation" should be set on each directory type output port of a task if that port's output should be saved. This is also true for multiplex output ports. If the "persist" flag is not set on a directory type output port, the output from that port will not be saved.

  2. If "persist" = true, but no "persistLocation" is set, the default location, will be used. The default location is:
    s3://gbd-customer-data/<account id>/workflow_output/<workflow id>/<task name>/<output port name>/<file>

  3. "persistLocation" should not be used alone. The file will only be saved to a specified location if the persistflag is set to "true" andpersistLocation` is specified.

  4. If the same persistLocation is used for several outputs, data in the folder will be accumulated. If there are conflicts, the data will be overwritten with the latest file. If two tasks run in parallel with the same outputs and the same persist location, the behavior is undefined.

Example

This example shows the persist and persistLocation descriptors on a task's output port. This is part of a workflow definition.

 {
   "name": "test_persist",
   "tasks": [{
       "name": "task_1",
       "taskType": "test-string-to-file",
       "inputs": [{
           "name": "inputstring",
           "value": "for the demo!!!"
       }],
       "outputs": [{
           "name": "outputfile",
           "persist": true,
           "persistLocation": "for_demo"
       }]
   }]
}

Example Workflow response with persistLocation

This is the JSON response from the workflow request:

The persist location shown in the response is accountID/persistLocation name. The full path would be:

 s3://gbd-customer-data/7b216bd3-6523-4ca5-aa3b-1d8a5994f052/for_demo/filename
{
 "tasks": [
   {
     "inputs": [
       {
         "source": null,
         "dataflow_channel": null,
         "type": "string",
         "name": "inputstring",
         "value": "for the demo!!!"
       }
     ],
     "outputs": [
       {
         "persistLocation": "7b216bd3-6523-4ca5-aa3b-1d8a5994f052/for_demo",
         "dataflow_channel": null,
         "name": "outputfile",
         "value": null,
         "source": null,
         "type": "textfile"
       }
     ],
     "start_time": null,
     "taskType": "test-string-to-file",
     "id": "4458592653599309096",
     "name": "task_1",
     "note": null,
     "callback": null,
     "state": {
       "state": "pending",
       "event": "submitted"
     },
     "run_parameters": {
       "mounts": [
         {
           "read_only": false,
           "local": "/mnt/glusterfs",
           "container": "/mnt/glusterfs"
         },
         {
           "read_only": false,
           "local": "$task_data_dir",
           "container": "/mnt/work"
         }
       ],
       "image": "tdgp/test-string-to-file",
       "command": "",
       "devices": []
     },
     "timeout": 7200,
     "instance_info": {
       "domain": "default"
     }
   }
 ],
 "completed_time": null,
 "callback": null,
 "state": {
   "state": "pending",
   "event": "submitted"
 },
 "submitted_time": "2016-11-03T21:27:42.632147+00:00",
 "owner": "workflow owner's name",
 "id": "4458592653599519360"
}

Access the saved Output Files

Use your preferred method to access and download the output files from the S3 location.

S3 Browser
Use your GBDX username (this is typically your email address) and password to log in.

S3 Storage Service Course

gbdxtools

Error Conditions

  1. If "persist":true is not set for an output port descriptor, the file will not be saved.

  2. If a persistLocation is set. but persist does not equal "true", a 400 error will be returned.

  3. If invalid characters are used in the "persistLocation" name, a 400 error will be returned.

How to Save Task Outputs to a Personal S3 Bucket

To save a task's output files to a personal S3 bucket, run the "SaveToS3" task at the end of your workflow. This task requires temporary S3 credentials.

The Docker image for this task is auto-built on dockerhub here: https://hub.docker.com/r/tdgp/gbdx-task-stagedatatos3/

Quickstart Example

For more information on the python-based tool suite, "gbdxtools", see gbdxtools .

from gbdxtools import Interface
gbdx = Interface()

savedata = gbdx.Task('SaveToS3')
savedata.inputs.destination = 's3://your-bucket/somewhere_nice'
savedata.inputs.access_key_id = '<your-s3-access-key>'
savedata.inputs.secret_key = '<your-s3-secret-key>'
savedata.inputs.session_token = '<your-session-token>'
savedata.inputs.data = 'some-input-data-from-s3-or-another-task-output'

wf = gbdx.Workflow([savedata])
wf.execute()

Example Getting Temporary Credentials

# First we'll run atmospheric compensation on Landsat8 data
from gbdxtools import Interface
gbdx = Interface()

acomp = gbdx.Task('AComp', data='s3://landsat-pds/L8/033/032/LC80330322015035LGN00')

# Now we'll save the result to our own S3 bucket.  First we need to generate temporary AWS credentials
# (this assumes you have an AWS account and your IAM credentials are appropriately accessible via boto:
# env vars or aws config file)
import boto3
client = boto3.client('sts')
response = client.get_session_token(DurationSeconds=86400)
access_key_id = response['Credentials']['AccessKeyId']
secret_key = response['Credentials']['SecretAccessKey']
session_token = response['Credentials']['SessionToken']

# Save the data to your s3 bucket using the SaveToS3 task:
savetask = gbdx.Task('SaveToS3')
savetask.inputs.data = acomp.outputs.data.value
savetask.inputs.destination = "s3://your-bucket/your-path/"
savetask.inputs.access_key_id = access_key_id
savetask.inputs.secret_key = secret_key
savetask.inputs.session_token = session_token

workflow = gbdx.Workflow([acomp, savetask])
workflow.execute()

"SaveToS3" Task Definition

{
    "containerDescriptors": [
        {
            "command": "",
            "type": "DOCKER",
            "properties": {
                "image": "tdgp/gbdx-task-stagedatatos3:latest"
            }
        }
    ],
    "description": "Stage data from a directory into an S3 bucket.  You must generate temporary AWS credentials & supply them as inputs.",
    "inputPortDescriptors": [
        {
            "required": true,
            "type": "directory",
            "description": "The source directory",
            "name": "data"
        },
        {
            "required": true,
            "type": "string",
            "description": "AWS Access Key ID that is authorized to push to the s3 location specified in 'directory'.",
            "name": "access_key_id"
        },
        {
            "required": true,
            "type": "string",
            "description": "AWS Secret Access Key.",
            "name": "secret_key"
        },
        {
            "required": true,
            "type": "string",
            "description": "AWS Session Token.  Required as part of temporary credentials.",
            "name": "session_token"
        },
        {
            "required": true,
            "type": "string",
            "description": "full S3 URL where the data will be written.",
            "name": "destination"
        }
    ],
    "version": "0.0.7",
    "outputPortDescriptors": [],
    "taskOwnerEmail": "nricklin@digitalglobe.com",
    "properties": {
        "isPublic": true,
        "timeout": 7200
    },
    "name": "SaveToS3"
}

How to Save Task Outputs

There are two ways to save the output of a task to an S3 location.

  • To save a task's output to the GBDX customer S3 bucket, use the "Persist" flag and set the "persistLocation."

  • To save a task's output to a personal S3 bucket, use the new "SaveToS3" task. This task requires temporary AWS credentials.

This documentation explains how to use each of these methods to save a task's output files to the appropriate location.