Thinking in data | openshovelshack

A small step for mankind, a giant leap for me

I recently had somewhat of an epiphany when I realized something that must be obvious to most software devs: Once you start thinking in data, many things make much more sense! I know this revelation wouldn't win anyone the Nobel prize but for me it was like taking off a blindfold. Even though I am technically a mechanical engineer and have always tinkered with computers, I did not, in fact, think much about data for most of my life. But ever since I started building my own software tools and moved most of my workflow to the terminal, I have become data-pilled! And this kind of thinking becomes even more powerful when working with AI! Let me explain why this might be important for you even if you've had this wisdom for much longer than me.

Data, data everywhere

While doing some research on the topic I stumbled upon the DIKW pyramid. I had seen it before but it hadn't left a big impression apparently. Now I think it's a great model for understanding the importance of data even if it's very simplistic. The model has 4 layers from bottom to top: Data, Information, Knowledge and Wisdom. Each layer is required to form the next one. So without data, you wouldn't get anywhere. And when I say "think in data", what I mean specifically is structured or semi-structured data. Data can be everything, from a measurement taken with a sensor, to a report created by someone to a form on a website. But unless you're a programmer or a data scientist you rarely think deeply about the form and shape of this data. For most people, I would argue, it's just text and numbers. But as more and more people are looking to automate things with AI, we really should start thinking about data structures much more. This is especially true for Engineers who are already used to thinking in systems. Now that AI is becoming more and more ingrained in these systems, they should also start to think in data!

Predicting the next token

Most systems can be broken down into individual components that have inputs, a function and outputs. So far so simple. LLMs are exactly the same. Given some input tokens, they loop over predicting the next token until reaching an end-of-sequence token. The accumulated generated tokens form the output of the system. Feels like magic to some while others just see a bunch of matrix operations running on a GPU. I fall somewhere in between but I must admit that I'm a pretty heavy user of next-token-prediction. Using these kinds of systems as a chatbots is fine and especially useful for learning about new topics but the biggest potential for businesses in my opinion comes from automating tasks with the help of LLMs.

Process automation is by no means a recent innovation, it has probably existed ever since there were processes to begin with. Well, maybe not that long but at least since the beginning of the industrial revolution. But even with the advent of computers, the internet and robotic process automation, there was always one major bottleneck: The real world is messy and many things don't naturally come in the form of structured data. So unless you figured out a good way of reliably extracting data from arbitrary input automatically, you still had to do this step manually. For certain kinds of information, a simple regular expression might be good enough, for example for detecting phone numbers or email addresses. But for anything just a bit more vague or heterogeneous, you had to start creating heuristics or even train specialized NLP models. Think of something like sorting documents into different categories based on content, entering information from always different looking documents into an ERP system or judging the sentiment and urgency of a message in your inbox. Things like these were traditionally out of reach for many smaller businesses. I'm not saying it was impossible but the amount of front-loading required was substantial and often resulted in people deciding not to pursue it further and continue doing the steps manually.

Eliminating the bottleneck

People in Silicon Valley might be surprised by this, but the reality is that in many companies there are still countless manual processes today. It’s mind-blowing even without AI, there’s huge potential for process automation. What makes AI, and especially large language models, such a game changer in my opinion is that they can take fuzzy input and transform it into structured data. Suddenly, the bottleneck is gone. You’re able to automate an entire workflow almost completely, depending on the process, of course.

If we’re being honest, a lot of current work still consists of copying and pasting information from one format to another. Opening up PDF documents, scanning them with your eyes, extracting the relevant bits, and then putting that into an ERP system, for example. The reason I find this realization so important is that most people don’t think in data. They have no idea what data structures exist or even what a data type is. So when you talk to companies about process automation, they often assume AI can do everything by default. If you can convince people to start thinking just a little bit in structured data, the potential is enormous. People can start assessing for themselves which parts of their work ideally the most tedious ones could be automated. Once you learn to recognize these patterns, you can apply them everywhere. In your daily work, you can ask yourself: can I extract structured data from this process?

Personally, I absolutely hate form-driven workflows like the ones in e.g. SAP. They’re just horrible. Having to click through endless forms and sometimes enter the same value multiple times like a date drives me crazy. For example, booking a work trip might require entering the same date four times. That’s not good process design; that’s legacy. And legacy is everywhere. You can’t replace these systems overnight. That’s why, once you start seeing these patterns, you realize something powerful: maybe you can’t replace the legacy system, but you can take fuzzy, natural language input a user simply saying what kind of trip they need, where they want to go, and when and let AI handle the rest. That’s easy for AI. To make this robust and reliable, though, the key isn’t structured input to structured output. It’s fuzzy input to structured output. The input should remain unstructured because that’s what drives adoption people prefer natural interactions. But the output must be structured so that systems can easily parse it.

And I know what you might be thinking: "Why not just build an AI agent that does everything?". Of course, you could try this, chain a bunch of AI calls together and call it a day. One handling the input, another parsing it, and so on. But if we’re being sensible engineers then we need to ask ourselves: Do we really need to throw AI at every step of the problem? Classical software can handle many parts efficiently. The only real bottleneck is the fuzzy-to-structured step. That’s why I believe that learning to think in data and data structures is such an invaluable skill. Once you understand that pattern, you’ll see it everywhere. You don’t even have to design a full pipeline right away. You can start with as much AI as possible just to see what works and then gradually strip it down, replacing parts with more deterministic systems where it makes sense.

Inputs and Outputs

Getting an LLM to return structured output used to be a bit of a pain. There are tons of memes about "bro, just return JSON, please!" and it’s still a somewhat valid method. In the early days, LLMs could be overly verbose and return conversational parts before and/or after a json block. But the larger models have gotten so good that you often don’t need special tricks anymore to get valid JSON. But if you want automation to be reliable and always produce the same structured output, you might need something more formal like structured output mode. OpenAI’s API supports this through JSON schemas, ensuring the model only returns outputs matching the schema.

Local models can do this too. For example, Llama.cpp supports grammars, a method that defines valid generation patterns. The model can only produce outputs that conform to those rules. It’s great for guaranteeing valid JSON (or emojis, or any defined structure you might think of), though it can restrict the model output a bit, resulting in reduced accuracy of creativity. Another approach is fine-tuning which requires a good dataset. Or you could just rely on good old prompt engineering, optimizing the prompt bit by bit until the model gives you exactly what you need (Please, bro!). But why do something by hand if the AI can potentially do it better than you? That’s where DSPy comes in. DSPy has been around for a while, but it takes some getting used to, which is why it took me much longer that I wish until it clicked for me.

Prompt Programming with DSPy

The core idea of DSPy is that you program LLMs instead of prompting them. Instead of writing long prompts, you define input and output behavior through a signature (like function signatures in typed languages such as Rust). You provide a short docstring, define the structure, and DSPy handles the rest. The most powerful part is its optimizers. The latest one, called GEPA, takes your signature, a few example input-output pairs (I used about ten), and automatically experiments with different prompts until it finds the one that yields the best score according to your metric.

And defining a good metric is key. For my use case which is turning unstructured text into structured JSON, I use a metric that checks for the following among other things:

Is the output valid JSON?
Does it match the schema?
How many fields are correct versus missing or hallucinated?

Each of these factors contributes to the score, guiding the teacher model in helping the student improve. The result is an optimized prompt that often outperforms anything you’d come up with manually. You might be able to find an equally good prompt by hand eventually but it would probably take much longer. DSPy automates that work, and the only requirement is that you can run the teacher and student models (for local models) or that you pay for the tokens (for cloud models). I'm definitely in the former camp, running a gpt-oss-120B on an m2 Ultra MacStudio and Qwen3-Coder-30B on my local m4 MacBook Pro for the last couple of days. Here is an overview of the metrics I currently use:


[metrics]
# Weight for extra/hallucinated fields (0.0-1.0, default: 0.5)
# Lower values = lighter penalty for hallucinations vs wrong values
extra_field_weight = 0.5

# Beta parameter for F-beta score (default: 1.5)
# Values > 1.0 favor recall over precision, < 1.0 favor precision
# 1.5 means recall is 1.5x more important than precision
beta = 1.5

# Base score awarded for valid JSON parsing (default: 0.2)
base_parse_score = 0.2

# Additional score for schema validation (default: 0.2)
base_schema_score = 0.2

# Weight for field-level quality (F-beta of precision/recall) (default: 0.5)
field_weight = 0.5

# Weight for coverage bonus (encourages completeness) (default: 0.1)
coverage_weight = 0.1

And here’s the beauty of it: once you start running this locally, you’re no longer dependent on cloud AI models. Local models are good enough for many tasks, especially when you break complex workflows into smaller parts. You could even use DSPy’s optimized prompts to generate new training data for future fine-tuning though in many cases, you might not need to. DSPy alone can improve results dramatically. Check out this prompt I got as a result from DSPy for extracting contact details.

DSPy is also available for languages other than Python: there’s DSRs for Rust, ax for TypeScript, and I'm sure many others will follow. If you want to experiment with this yourself, check out my example repo. It demonstrates exactly this: converting unstructured text into structured output based on a predefined JSON schema. DSPy is not only for structured output of course, but the signature-based approach surely does lend itself perfectly for thinking in structured data!