{"_id":"581ba673f749100f00964af5","category":{"_id":"5601aee850ee460d0002224c","__v":20,"project":"55faeacad0e22017005b8265","version":"55faeacad0e22017005b8268","pages":["56023786930fe1170074bd2c","561d53a09463520d00cd11ef","561d546d31d9630d001eb5d1","561d54af31d9630d001eb5d3","561d54e56386060d00e0601e","561d554d9463520d00cd11f2","564246059f4ed50d008be1af","5643712a0d9748190079defb","564372751ecf381700343c1e","5643742008894c0d00031ed3","5643747a0d9748190079df01","564375c988f3a60d00ac86b0","56437d0f0d9748190079df13","56437e83f49bfa0d002f560a","56437f7d0d9748190079df15","5643810508894c0d00031ef5","5643826f88f3a60d00ac86cb","564382de88f3a60d00ac86ce","56e07ba14685db1700d94873","56e08c9b903c7a29001d5352"],"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2015-09-22T19:41:28.703Z","from_sync":false,"order":9,"slug":"tasks-and-workflows-guide","title":"Tasks and Workflows Guide"},"project":"55faeacad0e22017005b8265","__v":0,"parentDoc":null,"user":"55fae9d4825d5f19001fa379","version":{"_id":"55faeacad0e22017005b8268","project":"55faeacad0e22017005b8265","__v":35,"createdAt":"2015-09-17T16:31:06.800Z","releaseDate":"2015-09-17T16:31:06.800Z","categories":["55faeacbd0e22017005b8269","55faf550764f50210095078e","55faf5b5626c341700fd9e96","55faf8a7825d5f19001fa386","560052f91503430d007cc88f","560054f73aa0520d00da0b1a","56005aaf6932a00d00ba7c62","56005c273aa0520d00da0b3f","5601ae7681a9670d006d164d","5601ae926811d00d00ceb487","5601aeb064866b1900f4768d","5601aee850ee460d0002224c","5601afa02499c119000faf19","5601afd381a9670d006d1652","561d4c78281aec0d00eb27b6","561d588d8ca8b90d00210219","563a5f934cc3621900ac278c","5665c5763889610d0008a29e","566710a36819320d000c2e93","56ddf6df8a5ae10e008e3926","56e1c96b2506700e00de6e83","56e1ccc4e416450e00b9e48c","56e1ccdfe63f910e00e59870","56e1cd10bc46be0e002af26a","56e1cd21e416450e00b9e48e","56e3139a51857d0e008e77be","573b4f62ef164e2900a2b881","57c9d1335fd8ca0e006308ed","57e2bd9d1e7b7220000d7fa5","57f2b992ac30911900c7c2b6","58adb5c275df0f1b001ed59b","58c81b5c6dc7140f003c3c46","595412446ed4d9001b3e7b37","59e76ce41938310028037295","5a009de510890d001c2aabfe"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"v1","version_clean":"1.0.0","version":"1"},"updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-11-03T21:04:51.283Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":4,"body":"##Overview\n\nWhen a task is run, its output(s) can be used as the input to a subsequent task, saved to the GBDX customer S3 location, or both. This section describes how to save a task output to the customer S3 location.\n\nTo save a task output to the  GBDX customer S3 location, add the \"persist\" flag to the task output port. When there are multiple output ports for a task, set the \"persist\" flag for each one if the output from that port should be saved. \n\nTo specify the location where the output file will be saved, use \"persistLocation\". If no \"persistLocation\" is specified, the file output will be saved to the default location (explained below). \n[block:callout]\n{\n  \"type\": \"warning\",\n  \"title\": \"The \\\"persist\\\" flag is the recommended way to save task outputs to the S3 customer location.\",\n  \"body\": \"The \\\"stagedatatoS3\\\" task will be deprecated soon.\"\n}\n[/block]\nName |Description\n--- | ---\npersist |To save an output, the output port descriptor \"persist\":true must be set. If this descriptor is not provided, or the value is set to false, the output file will not be saved and will be lost. \npersistLocation | Specify the location the output file will be staged to on S3. If no location is specified, the file will be saved to the default location. Only the directory and subdirectory names should be specified. \nThis diagram represents a workflow with two tasks. Both tasks have 3 outputs.\n![Persist flag ](https://s3.amazonaws.com/gbdx-doc-images/Persist+Diagram.png)\n\n##Step-by-Step\n\n*Note: gbdxtools users, see [Saving Output Data to S3](http://gbdxtools.readthedocs.io/en/latest/running_workflows.html#saving-output-data-to-s3).*\n\n1. Define the tasks to run in the Workflow Definition.\n\n2. For each task, review the output ports:\n    A. If the output from a port will only be used as input to the subsequent task, do nothing.\n    B. If the output from a port needs to be saved to an S3 location, move on to Step 3.\n    \n3. To save the output from a port, add  \"persist\": true to the output port descriptors.\n\n4. To save the output to a specific location, add \"persistLocation\" and specify the directory name (do not specify the full path). If a location is not specified, the output will be saved to the default location.\n\n## How to Set the \"Persist\" flag on an output port\n\nTo save the output from a task, set the \"persist\" flag as an output port descriptor for any output that should be saved. \nThis example shows an output port with the descriptors \"name\" and \"persist\".\n\n\n     \"outputs\": [{\n                \"name\": \"outputfile\",\n                \"persist\": true\n    \nIf \"persist\": true is not set, the output file will not be saved, and cannot be retrieved later. \n\n## Default Output file Location\nIf \"persist\":true is set for an output, but persistLocation is not specified, the output will be saved to the default location.\n\nBy default, the task output will be saved to the following location:\n    s3://gbd-customer-data/<account id>/workflow_output/<workflow id>/<task name>/<output port name>/<file>.  \n \n## How to Specify the Output file location \n\nYou can specify a location or directory path where the output file should be saved. This is done by adding \"persistLocation\" to the output port descriptors under \"persist\": true. It may be easier to find the output files with a deeper directory structure.\nTo specify the output location, add \"persistLocation\" to the output port with the \"persist\" flag. \n\n            \"outputs\": [{\n                \"name\": \"outputfile\",\n                \"persist\": true\n                \"persistLocation\": \"specify_name\"\n    \nDo not include the full path to the S3 location. The directory you specify will automatically be prepended with: \n      s3://gbd-customer-data/<account id>\n\nThe output will be saved to this location:\n       s3://gbd-customer-data/<account id>/<specify_name>/\nSubdirectories should be separated by a single forward slash.  Double forward slashes will return a 400 error.\n\nFor example, if  a persistLocation of \"Task_1/output_dir\" is set, and the user's account ID is   734875da-2059-42lz-ad90-03e4o5198fz6,\nthe output will be saved to this full-path location:\n      s3://gbd-customer-data/734875da-2059-42lz-ad90-03e4o5198fz6`enter code here`/Task_1/output_dir/<file_name>\n\nTo see the full S3 location path after the workflow has been submitted, make a request to the Workflow Status endpoint with the workflow ID. \n\n## Business Rules\n1. \"Persist\" and \"persistLocation\" should be set on each output port of a task if that port's output should be saved. This is also true for multiplex output ports. If the \"persist\" flag is not set on an output port, the output from that port will not be saved. \n\n2. If \"persist\" = true, but no \"persistLocation\" is set, the default location,  will be used. The default location is: \n`s3://gbd-customer-data/<account id>/workflow_output/<workflow id>/<task name>/<output port name>/<file>`\n\n3. \"persistLocation\" should not be used alone. The file will only be saved to a specified location if the \"persist\" flag is set to \"true\" and persistLocation is specified.\n\n4. Output files cannot be saved to a personal S3 bucket. Files can be downloaded from the GBDX customer S3 location to a personal S3 location by a separate process if needed. See \"access the output files\" for a list of options for accessing and downloading the contents from the GBDX customer s3 location.\n\n5. If the same persistLocation is used for several outputs, data in the folder will be accumulated. If there are conflicts, the data will be overwritten with the latest file. If two tasks run in parallel with the same outputs and the same persist location, the behavior is undefined. \n \n### Example Workflow request body\n\nThis example shows the \"persist\" and \"persistLocation\" descriptors on a task's output port. This is part of a workflow definition.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \" {\\n   \\\"name\\\": \\\"test_persis\\\",\\n   \\\"tasks\\\": [{\\n       \\\"name\\\": \\\"task_1\\\",\\n       \\\"taskType\\\": \\\"test-string-to-file\\\",\\n       \\\"inputs\\\": [{\\n           \\\"name\\\": \\\"inputstring\\\",\\n           \\\"value\\\": \\\"for the demo!!!\\\"\\n       }],\\n       \\\"outputs\\\": [{\\n           \\\"name\\\": \\\"outputfile\\\",\\n           \\\"persist\\\": true,\\n           \\\"persistLocation\\\": \\\"for_demo\\\"\\n       }]\\n   }]\\n}\\n\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\n### Example Workflow response with persistLocation\n\nThis is the JSON response from the workflow request:\n\n*Note: Currently the response will list persistLocation as persist_location. This will be updated to \"persistLocation\" in our next release, scheduled for 11/10/2016*.\n\nThe persist location shown in the response is accountID/persistLocation name. The full path would be:\n\n     s3://gbd-customer-data/7b216bd3-6523-4ca5-aa3b-1d8a5994f052/for_demo/filename\n\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"{\\n \\\"tasks\\\": [\\n   {\\n     \\\"inputs\\\": [\\n       {\\n         \\\"source\\\": null,\\n         \\\"dataflow_channel\\\": null,\\n         \\\"type\\\": \\\"string\\\",\\n         \\\"name\\\": \\\"inputstring\\\",\\n         \\\"value\\\": \\\"for the demo!!!\\\"\\n       }\\n     ],\\n     \\\"outputs\\\": [\\n       {\\n         \\\"persistLocation\\\": \\\"7b216bd3-6523-4ca5-aa3b-1d8a5994f052/for_demo\\\",\\n         \\\"dataflow_channel\\\": null,\\n         \\\"name\\\": \\\"outputfile\\\",\\n         \\\"value\\\": null,\\n         \\\"source\\\": null,\\n         \\\"type\\\": \\\"textfile\\\"\\n       }\\n     ],\\n     \\\"start_time\\\": null,\\n     \\\"taskType\\\": \\\"test-string-to-file\\\",\\n     \\\"id\\\": \\\"4458592653599309096\\\",\\n     \\\"name\\\": \\\"task_1\\\",\\n     \\\"note\\\": null,\\n     \\\"callback\\\": null,\\n     \\\"state\\\": {\\n       \\\"state\\\": \\\"pending\\\",\\n       \\\"event\\\": \\\"submitted\\\"\\n     },\\n     \\\"run_parameters\\\": {\\n       \\\"mounts\\\": [\\n         {\\n           \\\"read_only\\\": false,\\n           \\\"local\\\": \\\"/mnt/glusterfs\\\",\\n           \\\"container\\\": \\\"/mnt/glusterfs\\\"\\n         },\\n         {\\n           \\\"read_only\\\": false,\\n           \\\"local\\\": \\\"$task_data_dir\\\",\\n           \\\"container\\\": \\\"/mnt/work\\\"\\n         }\\n       ],\\n       \\\"image\\\": \\\"tdgp/test-string-to-file\\\",\\n       \\\"command\\\": \\\"\\\",\\n       \\\"devices\\\": []\\n     },\\n     \\\"timeout\\\": 7200,\\n     \\\"instance_info\\\": {\\n       \\\"domain\\\": \\\"default\\\"\\n     }\\n   }\\n ],\\n \\\"completed_time\\\": null,\\n \\\"callback\\\": null,\\n \\\"state\\\": {\\n   \\\"state\\\": \\\"pending\\\",\\n   \\\"event\\\": \\\"submitted\\\"\\n },\\n \\\"submitted_time\\\": \\\"2016-11-03T21:27:42.632147+00:00\\\",\\n \\\"owner\\\": \\\"workflow owner's name\\\",\\n \\\"id\\\": \\\"4458592653599519360\\\"\\n}\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\n## Access the saved Output Files\n\nUse your preferred method to access and download the output files from the S3 location.\n\n[S3 Browser](http://s3browser.geobigdata.io/login.html)\nUse your GBDX username (this is typically your email address) and password to log in. \n\n[S3 Storage Service Course](doc:s3-storage-service-course) \n\n[gbdxtools](http://gbdxtools.readthedocs.io/en/latest/user_guide.html#getting-your-s3-information)\n\n## Error Conditions\n1. If \"persist\":true is not set for an output port descriptor, the file will not be saved. \n\n2.  If a persistLocation is set. but persist does not equal \"true\", a 400 error will be returned. \n\n3.  If [invalid characters](http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html) are used in the \"persistLocation\" name, a 400 error will be returned.","excerpt":"","slug":"how-to-save-task-outputs","type":"basic","title":"How to Save Task Outputs"}

How to Save Task Outputs


##Overview When a task is run, its output(s) can be used as the input to a subsequent task, saved to the GBDX customer S3 location, or both. This section describes how to save a task output to the customer S3 location. To save a task output to the GBDX customer S3 location, add the "persist" flag to the task output port. When there are multiple output ports for a task, set the "persist" flag for each one if the output from that port should be saved. To specify the location where the output file will be saved, use "persistLocation". If no "persistLocation" is specified, the file output will be saved to the default location (explained below). [block:callout] { "type": "warning", "title": "The \"persist\" flag is the recommended way to save task outputs to the S3 customer location.", "body": "The \"stagedatatoS3\" task will be deprecated soon." } [/block] Name |Description --- | --- persist |To save an output, the output port descriptor "persist":true must be set. If this descriptor is not provided, or the value is set to false, the output file will not be saved and will be lost. persistLocation | Specify the location the output file will be staged to on S3. If no location is specified, the file will be saved to the default location. Only the directory and subdirectory names should be specified. This diagram represents a workflow with two tasks. Both tasks have 3 outputs. ![Persist flag ](https://s3.amazonaws.com/gbdx-doc-images/Persist+Diagram.png) ##Step-by-Step *Note: gbdxtools users, see [Saving Output Data to S3](http://gbdxtools.readthedocs.io/en/latest/running_workflows.html#saving-output-data-to-s3).* 1. Define the tasks to run in the Workflow Definition. 2. For each task, review the output ports: A. If the output from a port will only be used as input to the subsequent task, do nothing. B. If the output from a port needs to be saved to an S3 location, move on to Step 3. 3. To save the output from a port, add "persist": true to the output port descriptors. 4. To save the output to a specific location, add "persistLocation" and specify the directory name (do not specify the full path). If a location is not specified, the output will be saved to the default location. ## How to Set the "Persist" flag on an output port To save the output from a task, set the "persist" flag as an output port descriptor for any output that should be saved. This example shows an output port with the descriptors "name" and "persist". "outputs": [{ "name": "outputfile", "persist": true If "persist": true is not set, the output file will not be saved, and cannot be retrieved later. ## Default Output file Location If "persist":true is set for an output, but persistLocation is not specified, the output will be saved to the default location. By default, the task output will be saved to the following location: s3://gbd-customer-data/<account id>/workflow_output/<workflow id>/<task name>/<output port name>/<file>. ## How to Specify the Output file location You can specify a location or directory path where the output file should be saved. This is done by adding "persistLocation" to the output port descriptors under "persist": true. It may be easier to find the output files with a deeper directory structure. To specify the output location, add "persistLocation" to the output port with the "persist" flag. "outputs": [{ "name": "outputfile", "persist": true "persistLocation": "specify_name" Do not include the full path to the S3 location. The directory you specify will automatically be prepended with: s3://gbd-customer-data/<account id> The output will be saved to this location: s3://gbd-customer-data/<account id>/<specify_name>/ Subdirectories should be separated by a single forward slash. Double forward slashes will return a 400 error. For example, if a persistLocation of "Task_1/output_dir" is set, and the user's account ID is 734875da-2059-42lz-ad90-03e4o5198fz6, the output will be saved to this full-path location: s3://gbd-customer-data/734875da-2059-42lz-ad90-03e4o5198fz6`enter code here`/Task_1/output_dir/<file_name> To see the full S3 location path after the workflow has been submitted, make a request to the Workflow Status endpoint with the workflow ID. ## Business Rules 1. "Persist" and "persistLocation" should be set on each output port of a task if that port's output should be saved. This is also true for multiplex output ports. If the "persist" flag is not set on an output port, the output from that port will not be saved. 2. If "persist" = true, but no "persistLocation" is set, the default location, will be used. The default location is: `s3://gbd-customer-data/<account id>/workflow_output/<workflow id>/<task name>/<output port name>/<file>` 3. "persistLocation" should not be used alone. The file will only be saved to a specified location if the "persist" flag is set to "true" and persistLocation is specified. 4. Output files cannot be saved to a personal S3 bucket. Files can be downloaded from the GBDX customer S3 location to a personal S3 location by a separate process if needed. See "access the output files" for a list of options for accessing and downloading the contents from the GBDX customer s3 location. 5. If the same persistLocation is used for several outputs, data in the folder will be accumulated. If there are conflicts, the data will be overwritten with the latest file. If two tasks run in parallel with the same outputs and the same persist location, the behavior is undefined. ### Example Workflow request body This example shows the "persist" and "persistLocation" descriptors on a task's output port. This is part of a workflow definition. [block:code] { "codes": [ { "code": " {\n \"name\": \"test_persis\",\n \"tasks\": [{\n \"name\": \"task_1\",\n \"taskType\": \"test-string-to-file\",\n \"inputs\": [{\n \"name\": \"inputstring\",\n \"value\": \"for the demo!!!\"\n }],\n \"outputs\": [{\n \"name\": \"outputfile\",\n \"persist\": true,\n \"persistLocation\": \"for_demo\"\n }]\n }]\n}\n", "language": "json" } ] } [/block] ### Example Workflow response with persistLocation This is the JSON response from the workflow request: *Note: Currently the response will list persistLocation as persist_location. This will be updated to "persistLocation" in our next release, scheduled for 11/10/2016*. The persist location shown in the response is accountID/persistLocation name. The full path would be: s3://gbd-customer-data/7b216bd3-6523-4ca5-aa3b-1d8a5994f052/for_demo/filename [block:code] { "codes": [ { "code": "{\n \"tasks\": [\n {\n \"inputs\": [\n {\n \"source\": null,\n \"dataflow_channel\": null,\n \"type\": \"string\",\n \"name\": \"inputstring\",\n \"value\": \"for the demo!!!\"\n }\n ],\n \"outputs\": [\n {\n \"persistLocation\": \"7b216bd3-6523-4ca5-aa3b-1d8a5994f052/for_demo\",\n \"dataflow_channel\": null,\n \"name\": \"outputfile\",\n \"value\": null,\n \"source\": null,\n \"type\": \"textfile\"\n }\n ],\n \"start_time\": null,\n \"taskType\": \"test-string-to-file\",\n \"id\": \"4458592653599309096\",\n \"name\": \"task_1\",\n \"note\": null,\n \"callback\": null,\n \"state\": {\n \"state\": \"pending\",\n \"event\": \"submitted\"\n },\n \"run_parameters\": {\n \"mounts\": [\n {\n \"read_only\": false,\n \"local\": \"/mnt/glusterfs\",\n \"container\": \"/mnt/glusterfs\"\n },\n {\n \"read_only\": false,\n \"local\": \"$task_data_dir\",\n \"container\": \"/mnt/work\"\n }\n ],\n \"image\": \"tdgp/test-string-to-file\",\n \"command\": \"\",\n \"devices\": []\n },\n \"timeout\": 7200,\n \"instance_info\": {\n \"domain\": \"default\"\n }\n }\n ],\n \"completed_time\": null,\n \"callback\": null,\n \"state\": {\n \"state\": \"pending\",\n \"event\": \"submitted\"\n },\n \"submitted_time\": \"2016-11-03T21:27:42.632147+00:00\",\n \"owner\": \"workflow owner's name\",\n \"id\": \"4458592653599519360\"\n}", "language": "json" } ] } [/block] ## Access the saved Output Files Use your preferred method to access and download the output files from the S3 location. [S3 Browser](http://s3browser.geobigdata.io/login.html) Use your GBDX username (this is typically your email address) and password to log in. [S3 Storage Service Course](doc:s3-storage-service-course) [gbdxtools](http://gbdxtools.readthedocs.io/en/latest/user_guide.html#getting-your-s3-information) ## Error Conditions 1. If "persist":true is not set for an output port descriptor, the file will not be saved. 2. If a persistLocation is set. but persist does not equal "true", a 400 error will be returned. 3. If [invalid characters](http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html) are used in the "persistLocation" name, a 400 error will be returned.