This is very interesting, but I'm wondering how it compares to just using a dyna...

everforward · 2025-09-30T16:57:04 1759251424

From a glance, it looks like very similar tradeoffs vs bash. Much harder to read in a medium-large application, but much more ergonomic IO and process control.

I.e. much faster to use dgsh for a basic processing DAG, much more painful to use dgsh for a large ETL pipeline.

Python with something like Prefect isn't something you'd use a REPL to bang out a one-off on, but it'd be more maintainable. dgsh would let you use a REPL to bang out a quick and dirty DAG.

DSpinellis · 2025-09-30T17:10:28 1759252228

I've found creating pipelines with Python to be messy and intuitive. Other than creating a DSL to express them I can't see how DAGs can be expressed naturally with Python's syntax.

Even creating tools in Python that can be connected together in a Unix shell pipeline isn't trivial. By default if a downstream program stops processing Python's output you get an unsightly broken pipe exception, so you need to execute signal.signal(signal.SIGPIPE, signal.SIG_DFL) to avoid this.

PaulHoule · 2025-09-30T15:34:08 1759246448

There is a lot of stuff for Python which follows the "express computation as a dag" approach, especially Apache Airflow

https://airflow.apache.org/

croemer · 2025-09-30T17:17:42 1759252662

I was curious but the docs are a nightmare. I clicked through a couple of pages and couldn't see a single simple non-trivial example.

DSpinellis · 2025-09-30T16:58:56 1759251536

Apache Airflow solves a very different problem. Its DAGs are static dependencies between sequentially executed processing steps, whereas the DAGs of dgsh express live direct data flows.

PaulHoule · 2025-09-30T17:20:12 1759252812

Yeah, there are also the boxes and lines tools like

https://www.knime.com/

which have their own subculture. You could solve the same problems they do with pandas and scikit-learn but people who use those tools would never use pandas and scikit-learn and vice versa.

Circa 2015 I was thinking those tools all had the architectural flaw that they pass relational rows over the lines as opposed to JSON objects (or equivalent) which means you had to realize joins as highly complex graphs where things that seem like local concerns to me require a global structure and where what seems like a little change to management changes the whole graph in a big way.

I found the people who were buying up that sort of tools didn’t give a damn because they thought customers demanded the speed of columnar execution which our way couldn’t deliver.

I made a prototype that gave the right answers every time and then went to work for a place which had some luck selling their own version that didn’t always give the right answers because: they didn’t know what algebra it supported, didn’t believe something like that had an algebra, and didn’t properly tear the pipeline down at the end.

jpitz · 2025-09-30T20:10:12 1759263012

Do you mean to say that two non-dependant tasks in an Airflow DAG aren't able to concurrently execute? Thats not my experience. I'm also confused by the use of 'static' in this context.

DSpinellis · 2025-09-30T20:24:42 1759263882

That's the point: non-dependant tasks can run concurrently in Airflow. In sh/BAsh/dgsh dependant tasks can also run concurrently, as in tar cf - . | xz.

jpitz · 2025-10-01T02:10:51 1759284651

Ok. thank you!

sunshine-o · 2025-09-30T15:54:34 1759247674

I respect Python but the upgrade to Python 3 showed that data processing workloads that can be handled by standard Unix tooling should stay there.

The upgrade was a nightmare for so many organizations. It shouldn't be that way but it was.

procaryote · 2025-09-30T16:53:55 1759251235

spawning shell commands and the equivalent of piping is surprisingly hard in python. It's almost easier to do in C

There are probably libraries that could help, but then you need to install dependencies which is sad in python for other reasons

croemer · 2025-09-30T17:14:24 1759252464

We use snakemake a lot in bioinformatics to take advantage of parallelism in workflows while staying close to Python: https://github.com/snakemake/snakemake

Others use nextflow but that requires learning Groovy and it's less intuitive.