I've been working for a little while on a gem to configure and deploy AWS Data Pipelines. At my day job, we use Data Pipeline to schedule various types of repeating jobs. If you're interested, you can see more details on the company blog. To summarize, we wanted a library to configure Data Pipelines as Ruby objects so we could easily compose, reuse, version control, etc.
In developing waterworks, we decided to build on top of the Ruby AWS SDK. We chose Ruby mostly because we didn't think that any one AWS SDK had significant advantages for Data Pipeline and secondarily because most people in the company are comfortable with Ruby. One other advantage of the Ruby SDK instead of something like Java is that the hashes expected by the SDK are very similar to the JSON ones used by the CLI and web console.
However, there is a caveat to this since the format used is subtly different:
JSON see reference
{
"objects": [
{
"id": "my_id",
"name": "my_name",
"field_key_1": "field_value_1",
...
"field_key_n": "field_value_n"
},
...
]
}
Ruby see reference
{
pipeline_objects: [ # required
{
id: "my_id", # required
name: "my_name", # required
fields: [ # required
{
key: "field_key_1", # required
string_value: "field_value_1",
# or
ref_value: "field_value_1",
},
...
],
},
...
]
}
The main difference is that the fields of each Pipeline Object are encapsulated
differently. In the JSON object, the fields are on the same level as the id
and name
. In the Ruby Hash format, the fields are encapsulated in a field
array with objects that have a key
and an explicit string_value
or
ref_value
. In the JSON format, the designation of string or reference is
implicit. This is a slight annoyance, but since each field is bound to a type
(see the fields tables in an example
object)
we can easily build this into our logic.