Dynamic YAML with Python computed properties for fusing API workflows and SQL

12 points 4 days ago

You seem to have stepped on the same landmine that Ansible did, by defaulting to the jinja2 [aka text/template silliness in golang] of using double mustaches in YAML. I hope you enjoy quoting things because you're going to be quoting everything for all time because "{" is a meaningful character in YAML. Contrast

      parameters:
        status: "{{ var('order_status') }}"

with

      parameters:
        # made famous by GitHub Actions
        status: ${{ var('order_status') }}

        # or the ASP.Net flavor:
        status2: <%= var('order_status2') %>

        # or the PHP flavor:
        status3: <?= var('order_status3') ?>

and, just like Ansible, it's going to get insaneo when your inner expression has a quote character, too, since you'll need to escape it from the YAML parser leading to leaning toothpick syndrome e.g.

      parameters:
        status: "{{ eval('echo \"hello\"') }}"

---

If you find my "but what about the DX?" compelling, also gravely consider why in the world `data_expression:` seems to get a pass, in that it is implicitly wrapped in the mustaches

---

edit: ah, that's why https://github.com/paloaltodatabases/sequor/blob/v1.2.0/src/... but https://github.com/paloaltodatabases/sequor/blob/v1.2.0/src/... is what I would suggest changing before you get a bunch of tech debt and have to introduce a breaking change. From

    str_rendered = Template(template_str, undefined=StrictUndefined).render(jinja_context)

      str_rendered = Template(template_str, undefined=StrictUndefined,
          variable_start_string="${{",
          variable_end_string="}}"
      ).render(jinja_context)
      # et al, if you want to fix the {# and {%, too

per https://jinja.palletsprojects.com/en/stable/api/#jinja2.Temp...

maxgrinev OP 1 day ago

Thank you for such an insightful suggestion and deep dive into the code - this is amazing feedback! I'll definitely switch to the ${{}} syntax you suggested.

Quick clarification on _expression: we intentionally use two templating systems - Jinja {{ }} for simple variable injection, and Python *_expression for complex logic that Jinja can't handle.

Actually, since we only use Jinja for variable substitution, should I just drop it entirely? We have another version implemented in Java/JavaScript that uses simple ${var-name} syntax, and we already have Python expressions for advanced scenarios. Might be cleaner to unify on ${var-name} + Python expressions.

Given how deeply you've looked into our system, would you consider using Sequor? I can promise full support including fundamental changes like these - your technical insight would be invaluable for getting the design right early on.

mdaniel 1 day ago

I'm not the target audience for this product, but I experience the pain from folks who embed jinja2/golang in yaml every single day, so I am trying to do whatever I can to nip those problems in the bud so maybe one day it'll stop becoming the default pattern

As for "complex logic that jinja can't handle," I am not able to readily identify what that would mean given that jinja has executable blocks but I do agree with you that its mental model can make writing imperative code inside those blocks painful (e.g. {% set _ = my_dict.update({"something":"else}) %} type silliness)

it ultimately depends on whether those _expression: stanzas are always going to produce a Python result or they could produce arbitrary output. If the former, then I agree with you jinja2 would be terrible for that since it's a templating language[1]. If the latter, then using jinja2 would be a harmonizing choice so the author didn't have to keep two different invocation styles in their head at once

1: one can see that in ansible via this convolution:

  body: >-
    {%- set foo = {} -%}
    {%- for i in ... -%}
    {%- endfor -%}
    {# now emit the dict as json #}
    {{ foo | to_json }}

vivzkestrel 1 day ago

forgive me for asking a few daft questions but i want to know a few things - who is the target audience for this (programmers / sql admins / companies with these guys) - what are they gaining using this tool - who are some other providers that offer similar stuff - how is your offering different from theirs - is this a commercial product, do you have plans to commercialize it like turning it into a subscription based model?

maxgrinev OP 1 day ago

Great questions! Let me break this down:

Target audience:

1) Enterprise IT teams who already know SQL/YAML - they can build complex integrations after ~1 hour of training using our examples, no prior Python needed

2) Modern data teams using dbt - Sequor complements it perfectly for data ingestion and activation

What they gain:

Full flexibility with structure. Enterprise IT folks go from zero to building end-to-end solutions in an hour without needing developer support. Think "dbt but for API integrations."

Competitors & differentiation:

1) Zapier/n8n: GUI looks easy but gets complex fast, poor database integration, can't handle bulk data

2) Fivetran/Airbyte: Pre-built connectors only, zero customization, ingestion-only

3) Us: Only code-first solution using open tech stack (SQL+YAML+Python) - gives you flexibility with Fivetran reliability

Business model:

1) Core engine: Open source, free forever

2) Revenue: On-premise server with enterprise features (RBAC, observability and execution monitoring with notifications, audit logs) - flat fee per installation, no per-row costs like competitors

3) Services: Custom connector development and app-to-app integration flows (we love this work!)

4) Cloud version maybe later - everyone wants on-premise now

The key difference:

we're the only tool that's both easy to learn AND highly customizable for all major API integration patterns: data ingestion, reverse ETL, and multi-step iPaaS workflows - all in one platform.

bz_bz_bz 1 day ago

Recalculating customer metrics like that in your main example seems like a massive waste of snowflake resources, no?

maxgrinev OP 1 day ago

Good catch! Yes, recalculating metrics across all historical data every run would be expensive in Snowflake. I chose this example for simplicity to show how the three operations work together, but you're absolutely right about the inefficiency. The flow can easily be optimized for incremental processing - pull only recent orders and update metrics for just the affected customers:

steps:

  # Step 1: Pull only NEW orders since last run

  - op: http_request
    request:
      source: "shopify"
      url: "https://{{ var('store_name') }}.myshopify.com/admin/api/{{ var('api_version') }}/orders.json"
      method: GET
      parameters:
        status: any
        updated_at_min_expression: "{{ last_run_timestamp() or '2024-01-01' }}"
      headers:
        "Accept": "application/json"
    response:
      success_status: [200]
      tables:
        - source: "snowflake"
          table: "shopify_orders_incremental"
          columns: { ... }
          data_expression: response.json()['orders']

  # Step 2: Update metrics ONLY for customers with new/changed orders
  - op: transform
    source: "snowflake"
    query: |
      MERGE INTO customer_metrics cm
      USING (
        SELECT 
          customer_id,
          SUM(total_price::FLOAT) as total_spend,
          COUNT(*) as order_count
        FROM shopify_orders 
        WHERE customer_id IN (
          SELECT DISTINCT customer_id 
          FROM shopify_orders_incremental
        )
        GROUP BY customer_id
      ) new_metrics
      ON cm.customer_id = new_metrics.customer_id
      WHEN MATCHED THEN 
        UPDATE SET 
          total_spend = new_metrics.total_spend,
          order_count = new_metrics.order_count,
          updated_at = CURRENT_TIMESTAMP()
      WHEN NOT MATCHED THEN
        INSERT (customer_id, total_spend, order_count, updated_at)
        VALUES (new_metrics.customer_id, new_metrics.total_spend, new_metrics.order_count, CURRENT_TIMESTAMP())

  # Step 3: Sync only customers whose metrics were just updated
  - op: http_request
    input:
      source: "snowflake"
      query: |
        SELECT customer_id, email, total_spend, order_count
        FROM customer_metrics 
        WHERE updated_at >= '{{ run_start_timestamp() }}'
    request:
      source: "mailchimp"
      url_expression: |
        f"https://us1.api.mailchimp.com/3.0/lists/{var('list_id')}/members/{hashlib.md5(record['email'].encode()).hexdigest()}"
      method: PATCH
      body_expression: |
        {
          "merge_fields": {
            "TOTALSPEND": record['total_spend'],
            "ORDERCOUNT": record['order_count']
          }
        }

This scales much better: if you have 100K customers but only 50 new orders, you're recalculating metrics for ~50 customers instead of all 100K. Same simple workflow pattern, just production-ready efficiency.

Does this address your concern or did you mean something else? Would you suggest I use a slightly more complex but optimized example for the main demo? Your feedback is welcome and appreciated!

bz_bz_bz 1 day ago

I appreciate the response and detail. The code in your response definitely piqued my interest in the product more than the initial demo code does, but I do understand why you’d want simplicity on your homepage.

maxgrinev OP 4 days ago

Dynamic YAML with computed properties could have applications beyond API integrations. We use Python since it's familiar to data engineers, but our original prototype with JavaScript had even more compact syntax. Would love feedback on our approach and other use cases for dynamic YAML.

seebeen 1 day ago

What a great idea - let's combine one of the worst languages ever invented with database backend which wasn't ever meant to be used as a "processing engine"

revskill 2 days ago (dead)

This item has no comments currently.