{"_id":"561d53a09463520d00cd11ef","user":"55fae9d4825d5f19001fa379","version":{"_id":"55faeacad0e22017005b8268","project":"55faeacad0e22017005b8265","__v":33,"createdAt":"2015-09-17T16:31:06.800Z","releaseDate":"2015-09-17T16:31:06.800Z","categories":["55faeacbd0e22017005b8269","55faf550764f50210095078e","55faf5b5626c341700fd9e96","55faf8a7825d5f19001fa386","560052f91503430d007cc88f","560054f73aa0520d00da0b1a","56005aaf6932a00d00ba7c62","56005c273aa0520d00da0b3f","5601ae7681a9670d006d164d","5601ae926811d00d00ceb487","5601aeb064866b1900f4768d","5601aee850ee460d0002224c","5601afa02499c119000faf19","5601afd381a9670d006d1652","561d4c78281aec0d00eb27b6","561d588d8ca8b90d00210219","563a5f934cc3621900ac278c","5665c5763889610d0008a29e","566710a36819320d000c2e93","56ddf6df8a5ae10e008e3926","56e1c96b2506700e00de6e83","56e1ccc4e416450e00b9e48c","56e1ccdfe63f910e00e59870","56e1cd10bc46be0e002af26a","56e1cd21e416450e00b9e48e","56e3139a51857d0e008e77be","573b4f62ef164e2900a2b881","57c9d1335fd8ca0e006308ed","57e2bd9d1e7b7220000d7fa5","57f2b992ac30911900c7c2b6","58adb5c275df0f1b001ed59b","58c81b5c6dc7140f003c3c46","595412446ed4d9001b3e7b37"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"v1","version_clean":"1.0.0","version":"1"},"category":{"_id":"5601aee850ee460d0002224c","__v":20,"project":"55faeacad0e22017005b8265","version":"55faeacad0e22017005b8268","pages":["56023786930fe1170074bd2c","561d53a09463520d00cd11ef","561d546d31d9630d001eb5d1","561d54af31d9630d001eb5d3","561d54e56386060d00e0601e","561d554d9463520d00cd11f2","564246059f4ed50d008be1af","5643712a0d9748190079defb","564372751ecf381700343c1e","5643742008894c0d00031ed3","5643747a0d9748190079df01","564375c988f3a60d00ac86b0","56437d0f0d9748190079df13","56437e83f49bfa0d002f560a","56437f7d0d9748190079df15","5643810508894c0d00031ef5","5643826f88f3a60d00ac86cb","564382de88f3a60d00ac86ce","56e07ba14685db1700d94873","56e08c9b903c7a29001d5352"],"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2015-09-22T19:41:28.703Z","from_sync":false,"order":8,"slug":"tasks-and-workflows-guide","title":"Tasks and Workflows Guide"},"project":"55faeacad0e22017005b8265","parentDoc":null,"__v":14,"updates":[],"next":{"pages":[],"description":""},"createdAt":"2015-10-13T18:55:28.829Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":1,"body":"[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Task and Workflow Overview\"\n}\n[/block]\nThe Task and Workflow Course explains how to integrate a task in to the GBDX workflow system. \n\nThe following topics will be covered:\n\n* Workflow System Overview\n* What is a task?\n* Setting up a task Docker\n* Workflow Definition\n* Using the StageToS3 Task\n\nTo learn more about the Workflow API, see the [Workflow API Course](doc:workflow-api-course) \n\n##I. Definitions\n\n*Term*| *Definition*\n--- | ---- \nTask | A single data processing algorithm that contains well-defined input and output data requirements. Tasks are run from Docker containers that are available via Docker Hub. \nTask Docker | Tasks are run from Docker containers. Docker containers must be set up so that the task runs with a set of input data and will produce a set of output data without interruption. The task Docker must adhere to certain constraints in order to work with the workflow system. \nTask Registry | The task registry is a catalog of available tasks. All tasks must be registered in the task registry. \nTask Definition | The JSON document that defines the task. The task definition is used to submit the task to the task registry.\nWorkflow |  The workflow system chains tasks together into workflows that can be run on the platform to process input data. This is where the specified tasks are ordered and the inputs and outputs are used. \nWorkflow Definition |The JSON document that defines the execution of a workflow. The workflow definition is submitted to the endpoint to run the workflow.\nAWS | Amazon Web Services \nS3 | AWS online file storage web service and the location for processed image data.\nOAuth2 token | An OAuth token is required to submit a request to any GBDX endpoint. See the [Authentication Course](doc:authentication-course) for more information.\n \n\n\n##II. Workflow System Overview\nThe GBDX Workflow system chains tasks together into workflows. A task is an atomic process that performs a \nspecific action. Tasks have at least one input and can have one or more outputs. Tasks are run from Docker containers.\n\nWhen a workflow is run, it will iterate through the tasks and run all tasks with inputs in a ready state. \n\nThe sequence of a workflow is that:\n\n1. An instance of a workflow definition is submitted to the workflow endpoint. This starts the workflow.\n2. The workflow endpoint will validate the workflow against a JSON schema.\n3. The workflow endpoint will then compare the given tasks in the workflow to the registered tasks in the task registry.  It is the task registry that contains the authoritative definition of the tasks inputs and outputs as well as the run parameters.  \n4. After validation, the ready tasks are launched.  It is at this time that the task Dockers are pulled and then run.  \n5. The workflow is complete when all tasks have been run or an error is encountered. It will have a \"success\" or \"fail\" status.\n\n##III. Tasks Overview\n\nA task is an atomic process that is performed as a step in a workflow.  \nFor example, the \"AOP_Strip_Processor\" task runs orthorectification on the input image.Other processes such as atmospheric compensation (AComp), pan sharpening, and dynamic range adjustment (DRA) can be turned on in the task definition.  Platform tasks are run from Docker \ncontainers that are available via Docker Hub.  \n\nThe task is described in the workflow system's task registry. In the task registry description: \n\n1. The task is named.\n2. The task is described.\n3. The task's inputs and outputs are specified.\n4. Details of the Docker container are given.  \n\nWorkflows are composed of tasks that are specified in the workflow definition document.  \nIt is in this workflow definition document where the tasks are ordered and the specific inputs and outputs are used.\n\nSee the [Workflow API Course](doc:workflow-api-course)  for more information.\n\n###A. Task Dockers\n\nThis documentation assumes the user has a working knowledge of Docker and setting up Docker containers. \nIf you're not familiar with Docker, the [Docker website](http://www.docker.com/) offers [documentation](http://docs.docker.com/) and interactive tutorials. \n\nA task is run from a task Docker container.  This Docker container will be set up so that the task runs with a set of input data, and it will produce a set of output data without interruption.  For the task Docker to work with the platform workflow, certain constraints must be adhered to.  \n\n####Constraint #1: Location of input and output data.\nThe first constraint is for the location of the input and output data.  The workflow system makes a distinction between \"string\" data and “directory” data.  \"String\" data is generally used to pass parameters into a task.  “Directory” data is for all file-based data.  \n\n* String data: a file named ```/mnt/work/input/ports.json``` is created and it contains name/value pairs for each input string port.  \n* Directory data ports:  a data directory is automatically mounted inside the Docker container at ```/mnt/work``` with “input” and “output” subdirectories.  \n\nInput data will be mounted into the input directory before the task Docker is run.  One subdirectory for each input port will be created in the input directory.  For example, if the task has two ports named “in1” and “in2”, two subdirectories named “in1” and “in2” will be created in the input directory at locations ```/mnt/work/input/in1``` and ```/mnt/work/input/in2```.  If in the workflow definition document the task’s input port is given a “value” of an S3 location, then the contents of that s3 location are copied to the input port directory.  \n\nIf in the workflow definition document the task’s input port is given a “source” of an output port from a previous task, then the results from the previous task are copied to the input port directory.\n\nFor any input string ports, the file ```/mnt/work/input/ports.json``` will be created inside the Docker container.  The contents of the ports.json file will be a simple JSON containing name/value pairs where the name is the input port name and the value is the input value.\n\nFor task outputs it is the responsibility of the Docker task to create the output data directories.  \nFor example, the directory ```/mnt/work/output``` will already exist inside the container, but ```/mnt/work/output/out1``` must be created and output data placed inside it.\n\nFor any output string ports, ```/mnt/work/output/ports.json``` must be created in the output directory by the Docker task.  The contents of the ports.json file will be a simple json containing name value pairs where the name is the output port name and the value is the output value.\n\n#### Constraint #2: Task Docker must write a status.json file to ```/mnt/work/status.json```\nThe other constraint on a task Docker is that it needs to write a status.json file upon completion.  \nThe status.json file must have a status name/value pair where the name is “status” and the value is \"success\" for a successful completion.  \nTo report an error the “status” value is any value other then “success” and a reason name/value pair is used where the name is “reason” and the value is a string that gives the reason for the failed state.\n\n\nExample ```/mnt/work/status.json``` file for a successful task:\n```json\n{\n  \"status\": \"success\",\n  \"reason\": \"because everything worked!\"\n}\n```\n\nExample ```/mnt/work/status.json``` file for a failed task:\n```json\n{\n  \"status\": \"failed\",\n  \"reason\": \"because nothing worked!\"\n}\n```\n\n\n#### Integrating a Docker task into the Workflow System.\n\nTo recap, this is how the Docker task integrates with the workflow system:\n###### 1. A data directory is auto-mounted inside the Docker container at ```/mnt/work/```, containing all input string data and directory data.\n* The data directory has an “input” subdirectory filled in before the run command is given\n\n  * Each “directory” data port is copied to a subdirectory in the “input” folder.\n  * Each “string” data port is added as a name/value pair to the ports.json file in the “input” folder.\n\n* The data directory has an empty “output” subdirectory created.\n\nFor example, on startup the Docker container directory structure might look like this:\n```\n/\n└── mnt\n    └── work\n        ├── input\n        │   ├── in1\n        │   │   ├── file1.tif\n        │   │   └── file2.tif\n        │   ├── in2\n        │   │   ├── file3.tif\n        │   │   └── file4.tif\n        │   └── ports.json\n        └── output\n```\n\n###### 3. The Docker task runs and must write output.\n* In the passed-in volume’s “output” directory, one subdirectory needs to be created for each output port defined for the task.  The files for that output port will be created in the subdirectory.\n* If any output “string” ports exist, the ports.json needs to be created in the output directory.  For each string output port, a name/value pair needs to be created.\n* status.json needs to be created in the output directory and filled in.  \n* On success, the “status”: \"success” name/value pair needs to be added.  \n* On error, the “status”: \"failed” and “reason”: \"reason message” should be written. \n\nThe resulting directory and file structure might look like this:\n```\n/\n└── mnt\n    └── work\n        ├── input\n        │   ├── in1\n        │   │   ├── file1.tif\n        │   │   └── file2.tif\n        │   ├── in2\n        │   │   ├── file3.tif\n        │   │   └── file4.tif\n        │   └── ports.json\n        ├── output\n        │   ├── out1\n        │   │   ├── outputfile1\n        │   │   └── outputfile2\n        │   ├── out2\n        │   │   └── outputfile3\n        │   └── ports.json\n        └── status.json\n```\n\n###B.  The Task Registry\nThe Task Registry is used to store the definitions of the tasks that can be used in a workflow.  These definitions publish the interface used for the inputs and outputs for the task. The tasks are defined in a JSON document containing the following:\n\nname:The name of the task.  Note: this value goes into the workflow definition as the “taskType”\n\ndescription: A human-readable description of the task.\n\ninputPortDescriptors: A list/array of input ports containing the following:\n\nName |Description\n--- | --- \nname| the name of the port \ntype | the data type of the port \ndescription |the human readable description of the port \nrequired  |true/false binary value.  True if the port must be specified for the task to run. \n\n\noutputPortDescriptors:A list/array of output ports containing the following:\n\nName |Description\nname| the name of the port \ntype | the data type of the port \ndescription |the human readable description of the port \n\ncontainerDescriptors: A list/array of Docker Hub containers for the task.  Each container descriptor contains the following:\n\nName |Description\n--- | --- \ntype | The type of container. Currently only “DOCKER” is supported.\nproperties | Dependent on type, for “DOCKER” containers the following: image - the full name of the Docker image on Docker Hub, ex. tdgp/AOP_Strip_Processor.\n\nAdditional Properties:\nisPublic\nauthorizationRequired\n\n\n###C.  Container Descriptor\nThe container descriptor is part of a task definition within the task registry.  It tells the platform’s workflow worker machine how to run the task Docker container.  \n\nThis is an example launch command:\n\n\"containerDescriptors\":\n```json\n[  \n    {\n        \"type\": \"DOCKER\",\n        \"properties\": {\n            \"image\": \"tdgp/test-container\",\n        }\n    }\n]\n```\n\n###D.  Registering a Task with the task registry\nA task can be registered with the task registry by sending a POST of the task description to the task registry task endpoint.\n \n/workflows/v1/tasks\n \nAll registered tasks can be seen by doing a GET on that same task endpoint.  Specific task definitions can be retrieved by doing a GET on the task endpoint adding the task name to the URL. For example:  workflows/v1/tasks/AOP_Strip_Processor.\n\n\n###E.  Workflow Definition\nA Workflow is specified using a JSON document.  The workflow document conforms to the platform’s workflow definition language.  The requirements for a workflow document are: \n\n1. The JSON must contain a Name that references the workflow.\n2. The JSON must contain an array/list of tasks to run.  \n\nEach task in the tasks list contains the following:\n\n*1. taskType:*The task name defined in the task registry.  Note: taskType does not have to be unique in the workflow definition, the same task can be included multiple times in the workflow. \n\n*2. name:*The user supplied name for this instance of the task.  Task name must be unique in the workflow definition.\n\n3. inputs:A list of inputs for the task. These must correspond with the input ports for the task as defined in the task registry.  The input will contain the following:\n\nName| Description\n--- | --- \nname |the name of the input port. This value must correspond with the input port name as defined in the task registry. \nvalue | a hard-coded value for the input port.  The value can be used to specify the workflow'ss input datasets.  It also works for specifying run parameters for the task. \nsource | the output port to be used as the input for the task.  The source is used to connect the output from one task to the input of the next task.  The source is specified using the previous task's name and that task's output port name in the form of “prev-task-name:prev-port-name”.  \n\n__4. outputs:__ A list of outputs for the task. These must correspond with the output ports for the task as defined in the task registry.  The output will contain the following:\n\n__Name__ | __Description__ \n--- | --- \nname | the name of the output port, this value must correspond with the output port name as defined in the task registry.\n\n###F. Launching a workflow on the platform\nThe workflow can be launched by sending a POST with the workflow definition to the workflow endpoint.\n/workflows/v1/workflows  \n\nThe POST will return a JSON response that includes the workflow ID.  \n\nThe status of the running workflow can be retrieved by doing a GET on the workflow status endpoint. \n/workflows/v1/workflows/ workflow_id\n\nTo see examples of these actions, see [Get a Workflow's status events](doc:get-a-workflows-status-events) \n\nA token is required to access any GBDX endpoint. See [GBDX Authentication documentation](http://docs.gbdauthenticationauthorization.apiary.io/#) for more information. \n\n##IV. GBDX Platform Tasks\n\n### Using the StageToS3 Task\nThe purpose of the StageToS3 task is to copy the processed data to an accessible location on Amazon's S3. StageToS3 is typically the final task performed.\n\n#### StageToS3 Inputs\n\nThe StageToS3 task has two inputs:\n\n__Input__ | __Description__ \n--- | --- \ndata | StageToS3 uses a previous task's output port as the \"source\" for this input port. This is the data directory that will be copied to Amazon's S3. |\ndestination | This is the full URL to the S3 location to copy the data to. \\StageToS3 Inputs\n\n#### StageToS3 Outputs\nThe StageToS3 task has no output ports.\n\n#### StageToS3 Task Definition\nThis is the JSON document for the StageToS3 task definition. This is used in the task registry.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"        {\\n           \\\"inputPortDescriptors\\\":[\\n              {\\n                 \\\"required\\\":true,\\n                 \\\"description\\\":\\\"full S3 URL where the data will be written.\\\",\\n                 \\\"name\\\":\\\"destination\\\",\\n                 \\\"type\\\":\\\"string\\\"\\n              },\\n              {\\n                 \\\"required\\\":true,\\n                 \\\"description\\\":\\\"The source directory\\\",\\n                 \\\"name\\\":\\\"data\\\",\\n                 \\\"type\\\":\\\"directory\\\"\\n              }\\n           ],\\n           \\\"containerDescriptors\\\":[\\n              {\\n                 \\\"type\\\":\\\"DOCKER\\\",\\n                 \\\"properties\\\":{\\n                    \\\"image\\\":\\\"tdgp/stagetos3\\\"\\n                 },\\n                 \\\"description\\\":\\\"Stage data from a directory into an S3 bucket.\\\",\\n                 \\\"name\\\":\\\"StageDataToS3\\\",\\n                 \\\"properties\\\":null\\n              }\\n           ]\\n        }        \\n\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\n##V. Example Pan Sharpening Task\n\nLet's use an example of a pan sharpening task. We'll call this task \"pansharpen\". For this task we will:\n\n * List the inputs and outputs \n * Create the Docker Container \n * Create the task registry task definition \n * Register the task in the task registry \n * Create a workflow that uses the \"pansharpen\" task\n\n### Inputs and Output Ports\nThis pansharpen task has three inputs:\n\nInput| Description\n--- | --- \npan | the input pan image \nmulti | the input multispectral image \nresample | the string resampling method \n\nThis task has one output:\n\nOutput| Description\n--- | ---\nresult | the resulting pan-sharpened image \n\n### Creating the Docker Container\nNext, we create a Docker container with our pan-sharpening system installed. To do this, we create a script called  /usr/local/bin/run_pan_sharpen.sh inside the Docker container to get the input and call the pan-sharpening.  \n\nThis script will look for the input data in a fixed location, /mnt/work/input, with the contents of /mnt/work/input/pan to be used as the pan image, and with the contents of /mnt/work/input/multi to be used as the multispectral image.  \n\nThe run_pan_sharpen.sh can parse the resample method from the /mnt/work/input/ports.json file. It will contain a member named “resample” with a value that can be passed to the pan-sharpening system. \n\nThe directory structure of the data directory passed into the \"pansharpen\" Docker should be:\n\n* /mnt/work/input/pan/pan_image.tiff (+ any other ansilary files)\n* /mnt/work/input/multi/multi_image.tiff (+ any other ansilary files)\n* /mnt/work/input/ports.json (containing the “resample” name/value pair)\n* /mnt/work/output (empty directory)\n\nWhat/where the Docker will create the output data:\n\n* /mnt/work/output/result/result.tif\n* /mnt/work/status.json (containing the task run status)\n\nAfter building and testing the Docker container we can push it to our Docker Hub location as tdgp/pansharpen.\n\n### Creating the Task Registry Task Definition\nThe next step is to register the task in the task registry. To do this, we create a task definition JSON document. \n\n Now the task needs to be registered with the task registry.  A task definition JSON document is created.  When creating the task registry:\n * Be sure to use the same strings for the input/output port names as you use for the subdirectories within the Docker container.  \n * The runner script /usr/local/bin/run_pan_sharpen.sh EITHER must be included in the ```command``` field of containerDescriptors, or ```command``` must be left blank and the default Docker command will be run (defined via the ```CMD``` directive in the Dockerfile).\n\nHere's what the task definition JSON document should look like:\n\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \" {\\n           \\\"inputPortDescriptors\\\":[\\n              {\\n                 \\\"required\\\":true,\\n                 \\\"description\\\":\\\"The input pan image\\\",\\n                 \\\"name\\\":\\\"pan\\\",\\n                 \\\"type\\\":\\\"image\\\"\\n              },\\n              {\\n                 \\\"required\\\":true,\\n                 \\\"description\\\":\\\" The input multispectral image \\\",\\n                 \\\"name\\\":\\\"multi\\\",\\n                 \\\"type\\\":\\\"image\\\"\\n              },\\n              {\\n                 \\\"required\\\":true,\\n                 \\\"description\\\":\\\" Resample method, can be nearest, bilinear or cubic\\\",\\n                 \\\"name\\\":\\\"resample\\\",\\n                 \\\"type\\\":\\\"string\\\"\\n              }\\n           ],\\n           \\\"outputPortDescriptors\\\":[\\n              {\\n                 \\\"description\\\":\\\"The result pan sharpened image.\\\",\\n                 \\\"name\\\":\\\"result\\\",\\n                 \\\"type\\\":\\\"image\\\"\\n              }\\n           ],\\n           \\\"containerDescriptors\\\":[\\n              {\\n                 \\\"type\\\":\\\"DOCKER\\\",\\n                 \\\"properties\\\":{\\n                    \\\"image\\\":\\\"tdgp/pansharpen:latest\\\"\\n                 },\\n                 \\\"command\\\": \\\"/usr/local/bin/run_pan_sharpen.sh\\\",\\n                 \\\"description\\\":\\\"Pansharpen a multispectral image.\\\",\\n                 \\\"name\\\":\\\"pansharpen\\\",\\n                 \\\"properties\\\":null\\n              }\\n           ]\\n        }\\n\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\n       \nAfter creating the task descriptor JSON document, it can be submitted to the task registry by sending a POST to the task registry endpoint, https://geobigdata.io/workflows/v1/tasks.\n\nA GET can be made from that endpoint to see all available tasks. \n\nYour OAuth token is required to submit a request to one of these endpoints. See the [GBDX Authentication/Authorization document](http://docs.gbdauthenticationauthorization.apiary.io/#) for more information. \n\n\n### Creating a Workflow that uses the Pansharpen Task\n\nWe're ready to create a workflow that uses the pansharpen task. We'll name the workflow \"pansharpen_stagetoS3\" because it runs the pansharpen task and then the stagetoS3 task.\n\nWhen creating the workflow definition:\n\n* The input and output ports should have names that align with the Docker data directories and the task registry definition. \n* Values of the images in the folder can be specified in the input ports.\n\nThe platform's workflow system will copy the contents of the folders to the data directory passed to the Docker. \n\nThis is the example workflow definition JSON document for the workflow \"pansharpen_stagetoS3\":\n\n\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"        { \\n           \\\"name\\\":\\\"pansharpen_stagetos3\\\",\\n           \\\"tasks\\\":[ \\n              { \\n                 \\\"name\\\":\\\"PanSharpen\\\",\\n                 \\\"outputs\\\":[ \\n                    { \\n                       \\\"name\\\":\\\"result\\\"\\n                    }\\n                 ],\\n                 \\\"inputs\\\":[ \\n                    { \\n                       \\\"name\\\":\\\"pan\\\",\\n                       \\\"value\\\":\\\"http://bucket.s3.amazonaws.com/test_pan_image\\\"\\n                    },\\n                    { \\n                       \\\"name\\\":\\\"multi \\\",\\n                       \\\"value\\\":\\\"http://bucket.s3.amazonaws.com/test_multi_image\\\"\\n                    },\\n                    { \\n                       \\\"name\\\":\\\" resample\\\",\\n                       \\\"value\\\":\\\"nearest\\\"\\n                    }\\n                 ],\\n                 \\\"taskType\\\":\\\"pansharpen\\\"\\n              },\\n              { \\n                 \\\"name\\\":\\\"StageToS3\\\",\\n                 \\\"inputs\\\":[ \\n                    { \\n                       \\\"name\\\":\\\"data\\\",\\n                       \\\"source\\\":\\\" PanSharpen:result\\\"\\n                    },\\n                    { \\n                       \\\"name\\\":\\\"destination\\\",\\n                       \\\"value\\\":\\\"http://bucket.s3.amazonaws.com/output\\\"\\n                    }\\n                 ],\\n                 \\\"taskType\\\":\\\"StageDataToS3\\\"\\n              }\\n           ]\\n        }\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]","excerpt":"Use the Task and Workflow guide to set up tasks and run them in workflows.","slug":"task-and-workflow-course","type":"basic","title":"Task and Workflow Course"}

Task and Workflow Course

Use the Task and Workflow guide to set up tasks and run them in workflows.

[block:api-header] { "type": "basic", "title": "Task and Workflow Overview" } [/block] The Task and Workflow Course explains how to integrate a task in to the GBDX workflow system. The following topics will be covered: * Workflow System Overview * What is a task? * Setting up a task Docker * Workflow Definition * Using the StageToS3 Task To learn more about the Workflow API, see the [Workflow API Course](doc:workflow-api-course) ##I. Definitions *Term*| *Definition* --- | ---- Task | A single data processing algorithm that contains well-defined input and output data requirements. Tasks are run from Docker containers that are available via Docker Hub. Task Docker | Tasks are run from Docker containers. Docker containers must be set up so that the task runs with a set of input data and will produce a set of output data without interruption. The task Docker must adhere to certain constraints in order to work with the workflow system. Task Registry | The task registry is a catalog of available tasks. All tasks must be registered in the task registry. Task Definition | The JSON document that defines the task. The task definition is used to submit the task to the task registry. Workflow | The workflow system chains tasks together into workflows that can be run on the platform to process input data. This is where the specified tasks are ordered and the inputs and outputs are used. Workflow Definition |The JSON document that defines the execution of a workflow. The workflow definition is submitted to the endpoint to run the workflow. AWS | Amazon Web Services S3 | AWS online file storage web service and the location for processed image data. OAuth2 token | An OAuth token is required to submit a request to any GBDX endpoint. See the [Authentication Course](doc:authentication-course) for more information. ##II. Workflow System Overview The GBDX Workflow system chains tasks together into workflows. A task is an atomic process that performs a specific action. Tasks have at least one input and can have one or more outputs. Tasks are run from Docker containers. When a workflow is run, it will iterate through the tasks and run all tasks with inputs in a ready state. The sequence of a workflow is that: 1. An instance of a workflow definition is submitted to the workflow endpoint. This starts the workflow. 2. The workflow endpoint will validate the workflow against a JSON schema. 3. The workflow endpoint will then compare the given tasks in the workflow to the registered tasks in the task registry. It is the task registry that contains the authoritative definition of the tasks inputs and outputs as well as the run parameters. 4. After validation, the ready tasks are launched. It is at this time that the task Dockers are pulled and then run. 5. The workflow is complete when all tasks have been run or an error is encountered. It will have a "success" or "fail" status. ##III. Tasks Overview A task is an atomic process that is performed as a step in a workflow. For example, the "AOP_Strip_Processor" task runs orthorectification on the input image.Other processes such as atmospheric compensation (AComp), pan sharpening, and dynamic range adjustment (DRA) can be turned on in the task definition. Platform tasks are run from Docker containers that are available via Docker Hub. The task is described in the workflow system's task registry. In the task registry description: 1. The task is named. 2. The task is described. 3. The task's inputs and outputs are specified. 4. Details of the Docker container are given. Workflows are composed of tasks that are specified in the workflow definition document. It is in this workflow definition document where the tasks are ordered and the specific inputs and outputs are used. See the [Workflow API Course](doc:workflow-api-course) for more information. ###A. Task Dockers This documentation assumes the user has a working knowledge of Docker and setting up Docker containers. If you're not familiar with Docker, the [Docker website](http://www.docker.com/) offers [documentation](http://docs.docker.com/) and interactive tutorials. A task is run from a task Docker container. This Docker container will be set up so that the task runs with a set of input data, and it will produce a set of output data without interruption. For the task Docker to work with the platform workflow, certain constraints must be adhered to. ####Constraint #1: Location of input and output data. The first constraint is for the location of the input and output data. The workflow system makes a distinction between "string" data and “directory” data. "String" data is generally used to pass parameters into a task. “Directory” data is for all file-based data. * String data: a file named ```/mnt/work/input/ports.json``` is created and it contains name/value pairs for each input string port. * Directory data ports: a data directory is automatically mounted inside the Docker container at ```/mnt/work``` with “input” and “output” subdirectories. Input data will be mounted into the input directory before the task Docker is run. One subdirectory for each input port will be created in the input directory. For example, if the task has two ports named “in1” and “in2”, two subdirectories named “in1” and “in2” will be created in the input directory at locations ```/mnt/work/input/in1``` and ```/mnt/work/input/in2```. If in the workflow definition document the task’s input port is given a “value” of an S3 location, then the contents of that s3 location are copied to the input port directory. If in the workflow definition document the task’s input port is given a “source” of an output port from a previous task, then the results from the previous task are copied to the input port directory. For any input string ports, the file ```/mnt/work/input/ports.json``` will be created inside the Docker container. The contents of the ports.json file will be a simple JSON containing name/value pairs where the name is the input port name and the value is the input value. For task outputs it is the responsibility of the Docker task to create the output data directories. For example, the directory ```/mnt/work/output``` will already exist inside the container, but ```/mnt/work/output/out1``` must be created and output data placed inside it. For any output string ports, ```/mnt/work/output/ports.json``` must be created in the output directory by the Docker task. The contents of the ports.json file will be a simple json containing name value pairs where the name is the output port name and the value is the output value. #### Constraint #2: Task Docker must write a status.json file to ```/mnt/work/status.json``` The other constraint on a task Docker is that it needs to write a status.json file upon completion. The status.json file must have a status name/value pair where the name is “status” and the value is "success" for a successful completion. To report an error the “status” value is any value other then “success” and a reason name/value pair is used where the name is “reason” and the value is a string that gives the reason for the failed state. Example ```/mnt/work/status.json``` file for a successful task: ```json { "status": "success", "reason": "because everything worked!" } ``` Example ```/mnt/work/status.json``` file for a failed task: ```json { "status": "failed", "reason": "because nothing worked!" } ``` #### Integrating a Docker task into the Workflow System. To recap, this is how the Docker task integrates with the workflow system: ###### 1. A data directory is auto-mounted inside the Docker container at ```/mnt/work/```, containing all input string data and directory data. * The data directory has an “input” subdirectory filled in before the run command is given * Each “directory” data port is copied to a subdirectory in the “input” folder. * Each “string” data port is added as a name/value pair to the ports.json file in the “input” folder. * The data directory has an empty “output” subdirectory created. For example, on startup the Docker container directory structure might look like this: ``` / └── mnt └── work ├── input │   ├── in1 │   │   ├── file1.tif │   │   └── file2.tif │   ├── in2 │   │   ├── file3.tif │   │   └── file4.tif │   └── ports.json └── output ``` ###### 3. The Docker task runs and must write output. * In the passed-in volume’s “output” directory, one subdirectory needs to be created for each output port defined for the task. The files for that output port will be created in the subdirectory. * If any output “string” ports exist, the ports.json needs to be created in the output directory. For each string output port, a name/value pair needs to be created. * status.json needs to be created in the output directory and filled in. * On success, the “status”: "success” name/value pair needs to be added. * On error, the “status”: "failed” and “reason”: "reason message” should be written. The resulting directory and file structure might look like this: ``` / └── mnt └── work ├── input │   ├── in1 │   │   ├── file1.tif │   │   └── file2.tif │   ├── in2 │   │   ├── file3.tif │   │   └── file4.tif │   └── ports.json ├── output │   ├── out1 │   │   ├── outputfile1 │   │   └── outputfile2 │   ├── out2 │   │   └── outputfile3 │   └── ports.json └── status.json ``` ###B. The Task Registry The Task Registry is used to store the definitions of the tasks that can be used in a workflow. These definitions publish the interface used for the inputs and outputs for the task. The tasks are defined in a JSON document containing the following: name:The name of the task. Note: this value goes into the workflow definition as the “taskType” description: A human-readable description of the task. inputPortDescriptors: A list/array of input ports containing the following: Name |Description --- | --- name| the name of the port type | the data type of the port description |the human readable description of the port required |true/false binary value. True if the port must be specified for the task to run. outputPortDescriptors:A list/array of output ports containing the following: Name |Description name| the name of the port type | the data type of the port description |the human readable description of the port containerDescriptors: A list/array of Docker Hub containers for the task. Each container descriptor contains the following: Name |Description --- | --- type | The type of container. Currently only “DOCKER” is supported. properties | Dependent on type, for “DOCKER” containers the following: image - the full name of the Docker image on Docker Hub, ex. tdgp/AOP_Strip_Processor. Additional Properties: isPublic authorizationRequired ###C. Container Descriptor The container descriptor is part of a task definition within the task registry. It tells the platform’s workflow worker machine how to run the task Docker container. This is an example launch command: "containerDescriptors": ```json [ { "type": "DOCKER", "properties": { "image": "tdgp/test-container", } } ] ``` ###D. Registering a Task with the task registry A task can be registered with the task registry by sending a POST of the task description to the task registry task endpoint. /workflows/v1/tasks All registered tasks can be seen by doing a GET on that same task endpoint. Specific task definitions can be retrieved by doing a GET on the task endpoint adding the task name to the URL. For example: workflows/v1/tasks/AOP_Strip_Processor. ###E. Workflow Definition A Workflow is specified using a JSON document. The workflow document conforms to the platform’s workflow definition language. The requirements for a workflow document are: 1. The JSON must contain a Name that references the workflow. 2. The JSON must contain an array/list of tasks to run. Each task in the tasks list contains the following: *1. taskType:*The task name defined in the task registry. Note: taskType does not have to be unique in the workflow definition, the same task can be included multiple times in the workflow. *2. name:*The user supplied name for this instance of the task. Task name must be unique in the workflow definition. 3. inputs:A list of inputs for the task. These must correspond with the input ports for the task as defined in the task registry. The input will contain the following: Name| Description --- | --- name |the name of the input port. This value must correspond with the input port name as defined in the task registry. value | a hard-coded value for the input port. The value can be used to specify the workflow'ss input datasets. It also works for specifying run parameters for the task. source | the output port to be used as the input for the task. The source is used to connect the output from one task to the input of the next task. The source is specified using the previous task's name and that task's output port name in the form of “prev-task-name:prev-port-name”. __4. outputs:__ A list of outputs for the task. These must correspond with the output ports for the task as defined in the task registry. The output will contain the following: __Name__ | __Description__ --- | --- name | the name of the output port, this value must correspond with the output port name as defined in the task registry. ###F. Launching a workflow on the platform The workflow can be launched by sending a POST with the workflow definition to the workflow endpoint. /workflows/v1/workflows The POST will return a JSON response that includes the workflow ID. The status of the running workflow can be retrieved by doing a GET on the workflow status endpoint. /workflows/v1/workflows/ workflow_id To see examples of these actions, see [Get a Workflow's status events](doc:get-a-workflows-status-events) A token is required to access any GBDX endpoint. See [GBDX Authentication documentation](http://docs.gbdauthenticationauthorization.apiary.io/#) for more information. ##IV. GBDX Platform Tasks ### Using the StageToS3 Task The purpose of the StageToS3 task is to copy the processed data to an accessible location on Amazon's S3. StageToS3 is typically the final task performed. #### StageToS3 Inputs The StageToS3 task has two inputs: __Input__ | __Description__ --- | --- data | StageToS3 uses a previous task's output port as the "source" for this input port. This is the data directory that will be copied to Amazon's S3. | destination | This is the full URL to the S3 location to copy the data to. \StageToS3 Inputs #### StageToS3 Outputs The StageToS3 task has no output ports. #### StageToS3 Task Definition This is the JSON document for the StageToS3 task definition. This is used in the task registry. [block:code] { "codes": [ { "code": " {\n \"inputPortDescriptors\":[\n {\n \"required\":true,\n \"description\":\"full S3 URL where the data will be written.\",\n \"name\":\"destination\",\n \"type\":\"string\"\n },\n {\n \"required\":true,\n \"description\":\"The source directory\",\n \"name\":\"data\",\n \"type\":\"directory\"\n }\n ],\n \"containerDescriptors\":[\n {\n \"type\":\"DOCKER\",\n \"properties\":{\n \"image\":\"tdgp/stagetos3\"\n },\n \"description\":\"Stage data from a directory into an S3 bucket.\",\n \"name\":\"StageDataToS3\",\n \"properties\":null\n }\n ]\n } \n", "language": "json" } ] } [/block] ##V. Example Pan Sharpening Task Let's use an example of a pan sharpening task. We'll call this task "pansharpen". For this task we will: * List the inputs and outputs * Create the Docker Container * Create the task registry task definition * Register the task in the task registry * Create a workflow that uses the "pansharpen" task ### Inputs and Output Ports This pansharpen task has three inputs: Input| Description --- | --- pan | the input pan image multi | the input multispectral image resample | the string resampling method This task has one output: Output| Description --- | --- result | the resulting pan-sharpened image ### Creating the Docker Container Next, we create a Docker container with our pan-sharpening system installed. To do this, we create a script called /usr/local/bin/run_pan_sharpen.sh inside the Docker container to get the input and call the pan-sharpening. This script will look for the input data in a fixed location, /mnt/work/input, with the contents of /mnt/work/input/pan to be used as the pan image, and with the contents of /mnt/work/input/multi to be used as the multispectral image. The run_pan_sharpen.sh can parse the resample method from the /mnt/work/input/ports.json file. It will contain a member named “resample” with a value that can be passed to the pan-sharpening system. The directory structure of the data directory passed into the "pansharpen" Docker should be: * /mnt/work/input/pan/pan_image.tiff (+ any other ansilary files) * /mnt/work/input/multi/multi_image.tiff (+ any other ansilary files) * /mnt/work/input/ports.json (containing the “resample” name/value pair) * /mnt/work/output (empty directory) What/where the Docker will create the output data: * /mnt/work/output/result/result.tif * /mnt/work/status.json (containing the task run status) After building and testing the Docker container we can push it to our Docker Hub location as tdgp/pansharpen. ### Creating the Task Registry Task Definition The next step is to register the task in the task registry. To do this, we create a task definition JSON document. Now the task needs to be registered with the task registry. A task definition JSON document is created. When creating the task registry: * Be sure to use the same strings for the input/output port names as you use for the subdirectories within the Docker container. * The runner script /usr/local/bin/run_pan_sharpen.sh EITHER must be included in the ```command``` field of containerDescriptors, or ```command``` must be left blank and the default Docker command will be run (defined via the ```CMD``` directive in the Dockerfile). Here's what the task definition JSON document should look like: [block:code] { "codes": [ { "code": " {\n \"inputPortDescriptors\":[\n {\n \"required\":true,\n \"description\":\"The input pan image\",\n \"name\":\"pan\",\n \"type\":\"image\"\n },\n {\n \"required\":true,\n \"description\":\" The input multispectral image \",\n \"name\":\"multi\",\n \"type\":\"image\"\n },\n {\n \"required\":true,\n \"description\":\" Resample method, can be nearest, bilinear or cubic\",\n \"name\":\"resample\",\n \"type\":\"string\"\n }\n ],\n \"outputPortDescriptors\":[\n {\n \"description\":\"The result pan sharpened image.\",\n \"name\":\"result\",\n \"type\":\"image\"\n }\n ],\n \"containerDescriptors\":[\n {\n \"type\":\"DOCKER\",\n \"properties\":{\n \"image\":\"tdgp/pansharpen:latest\"\n },\n \"command\": \"/usr/local/bin/run_pan_sharpen.sh\",\n \"description\":\"Pansharpen a multispectral image.\",\n \"name\":\"pansharpen\",\n \"properties\":null\n }\n ]\n }\n", "language": "json" } ] } [/block] After creating the task descriptor JSON document, it can be submitted to the task registry by sending a POST to the task registry endpoint, https://geobigdata.io/workflows/v1/tasks. A GET can be made from that endpoint to see all available tasks. Your OAuth token is required to submit a request to one of these endpoints. See the [GBDX Authentication/Authorization document](http://docs.gbdauthenticationauthorization.apiary.io/#) for more information. ### Creating a Workflow that uses the Pansharpen Task We're ready to create a workflow that uses the pansharpen task. We'll name the workflow "pansharpen_stagetoS3" because it runs the pansharpen task and then the stagetoS3 task. When creating the workflow definition: * The input and output ports should have names that align with the Docker data directories and the task registry definition. * Values of the images in the folder can be specified in the input ports. The platform's workflow system will copy the contents of the folders to the data directory passed to the Docker. This is the example workflow definition JSON document for the workflow "pansharpen_stagetoS3": [block:code] { "codes": [ { "code": " { \n \"name\":\"pansharpen_stagetos3\",\n \"tasks\":[ \n { \n \"name\":\"PanSharpen\",\n \"outputs\":[ \n { \n \"name\":\"result\"\n }\n ],\n \"inputs\":[ \n { \n \"name\":\"pan\",\n \"value\":\"http://bucket.s3.amazonaws.com/test_pan_image\"\n },\n { \n \"name\":\"multi \",\n \"value\":\"http://bucket.s3.amazonaws.com/test_multi_image\"\n },\n { \n \"name\":\" resample\",\n \"value\":\"nearest\"\n }\n ],\n \"taskType\":\"pansharpen\"\n },\n { \n \"name\":\"StageToS3\",\n \"inputs\":[ \n { \n \"name\":\"data\",\n \"source\":\" PanSharpen:result\"\n },\n { \n \"name\":\"destination\",\n \"value\":\"http://bucket.s3.amazonaws.com/output\"\n }\n ],\n \"taskType\":\"StageDataToS3\"\n }\n ]\n }", "language": "json" } ] } [/block]