ICT for Civic Data — Crash Course 2026
# Foundations
Self-Paced Review — Part 0
civicliteraci.es
--- # The Data Pipeline --- ## The data pipeline has seven steps in two groups The data pipeline structures all data work from beginning to end.
Acquisition
**Define** — formulate the question, hypothesis, and horizon table **Find** — locate data sources that could answer the question **Get** — retrieve the data into your workspace
Processing
**Verify** — check that the data is what you think it is **Clean** — fix errors, standardise, consolidate **Analyse** — answer the questions you defined at the start
**Present** cuts across both groups. It is about communication: maps, charts, reports, dashboards. The output format depends on the audience and the story. Note: The pipeline comes from the School of Data, developed at the Open Knowledge Foundation. It applies at every scale: a quick Wikipedia check, a national data collection, or anything in between. Present is not just "make a chart": it is the translation of data work into something a stakeholder can use. --- ## The pipeline is iterative Two loops appear in practice: **Define ↔ Find ↔ Get**: discovering what data exists changes what questions you can answer. Retrieval issues may send you back to Find, and what you find may change the question itself. **Verify ↔ Clean ↔ Analyse**: during cleaning you may discover quality issues that require re-verification. During analysis, something may look wrong and send you back to cleaning. The pipeline is a living process, not a rigid sequence. Returning to an earlier step is a feature, not a failure. Note: The Define-Find loop was central to the crash course: students chose their angle on Day 1, then revised it after discovering what data actually existed on Day 2. The Verify-Clean-Analyse loop appeared on Days 3-4 when cleaning revealed unexpected data quality issues. --- # AI and the Pipeline --- ## The pipeline as a harness AI is powerful but walks while looking at its feet. It does not have a world model. It needs fences to stay on track. The data pipeline IS the harness: - Each step constrains what the AI does - Each step produces a verifiable output - Each boundary is a checkpoint where you inspect the work Just as a saddle and reins guide a horse, the pipeline guides AI. Without it, AI will jump from Get to Analyse without human verification, and you will not know where it went wrong. Note: From Lecture 1: the harness metaphor. The pipeline prevents the "happy path" problem where AI skips directly to analysis from raw data. Every organisation that uses AI effectively in data work constrains it to isolated, verifiable steps. --- ## Analysis is the easy part Data analysis is generally **almost the easy part**. Identifying which data you need, making sure it is correct and usable, and then cleaning it — those are the hard parts. Because once you have a clean dataset, you do something basic like a pivot table. **Example:** Mediterranean island wildfires — compare 60 reported rural wildfires to 20 urban ones over 50 years. That is analysis: answering your initial question with data. It does not have to be complex. People who work with data often say that **80% of the work is just cleaning**, because it is so rare to work with data that is already clean. And if data is always clean, it is probably not the data with the most value from a social change perspective. Note: From Day 5. This demystifies analysis. Students often think analysis means statistics or machine learning. In most civic data projects, analysis is counting, comparing, and mapping. The hard work is everything that comes before: finding the right data, getting it, verifying it, and cleaning it. The 80% figure is widely cited by data practitioners and was reinforced throughout the crash course. --- ## AI concentrates at Get and Clean
| Pipeline step | Human only | AI can help | AI can do it | |---|---|---|---| | **Define** | Research question, editorial framing | Brainstorming | | | **Find** | | Source discovery | | | **Get** | | | Extraction, scraping | | **Verify** | Data quality judgment | Cross-referencing | | | **Clean** | | | Tidying, formulas | | **Analyse** | Hypothesis testing, interpretation | | | | **Present** | Editorial message | Visualisation drafts | |
Get and Clean are mechanical, repetitive, and text-based: exactly what AI is good at. Define, Verify, and Analyse require judgment about meaning: exactly what AI is bad at. Note: From Lecture 2: the AI involvement heatmap. This pattern held throughout the crash course. AI wrote scraping scripts and cleaning formulas. Humans decided what data to collect, whether results made sense, and what the data meant for the proposal. --- ## More automation means less visibility | Level | Method | Example | |---|---|---| | **Manual** | Copy-paste | Select table, paste in spreadsheet | | **Semi-manual** | Built-in functions | IMPORTHTML in Google Sheets | | **Tool-assisted** | Visual, no code | WebScraper.io Chrome extension | | **AI-assisted** | Chatbot writes code | "Write me a script to scrape this" | | **AI-driven** | Agent does everything | AI navigates, extracts, saves | Each level gives you more power but **less visibility** into what is happening. As automation increases, the burden on **Verify** increases proportionally. Note: From Lecture 2: the verification burden. Students experienced this: when AI built a map in one step, they could not tell whether the data was correct. When they separated Find, Get, and Verify, errors became visible. --- # Working With AI --- ## Structure your prompts with five elements When a task has multiple steps, the prompt needs structure. Five elements: 1. **Objective** — what are you trying to achieve? 2. **Steps** — what should the AI do, in what order? 3. **Reasoning** — why each step? (so the AI does not skip or improvise) 4. **Output shape** — what should the result look like? 5. **Human verification** — how will you check the result? A prompt without these elements is an invitation for the AI to fill in the blanks, with hallucinations. The more complex the task, the more important it is to think through all five. When prompts fail, come back to this framework to diagnose what is missing. Note: Introduced on Day 3 of the crash course, after students had enough hands-on experience to appreciate why structure matters. The framework is specifically for data work with AI agents, not general chatbot usage. --- ## Tasks stay within pipeline steps A task must live entirely within **one** pipeline step. If it crosses step boundaries, it is two tasks. **Wrong:** "Find the data and make a map of it." **Right:** Two tasks: 1. **Find** — produce a list of data sources. You review the list. 2. **Get** — download data from a verified source. You check the file. Why: if you ask AI to do both at once and the map shows wrong data, you cannot tell whether the AI found the wrong source, fetched bad data, or rendered it incorrectly. Separating steps = separating failure modes. Note: From Lectures 2-3. The election results example: without separating Find and Get, the AI might hit anti-bot protection, silently fall back to an older dataset, and you would never know. --- ## Trust the code, not the reasoning
Trust
The AI's **technical capability**: it knows how to write code, query APIs, convert formats. Scripts are **deterministic**: the same script on the same data produces the same result every time.
Don't trust
The AI's **reasoning about data content**: it does not understand what the data means. Never ask the AI to directly modify the content of your data. Ask it to write a script that does it.
Exception: AI writing a formula inside a spreadsheet cell IS trustable because you see the result immediately in the cell and can verify it in place. Note: From Day 3 and Day 4 of the crash course. The principle: AI can modify data shape (filter rows, create columns, convert formats) via scripts. It should not modify data content (edit values, fill in blanks, correct typos) by itself. --- ## AI risk depends on what the code does
Lower risk: functional code
AI writes code for a website menu, a scraper, a map. The goal is **software that works**. You can test it: does the menu open? Did the scraper get the right rows? Does the map render? Verification is **direct and immediate**.
Higher risk: data processing code
AI writes code that ranks, compares, or classifies data. The goal is **decisions about the world**. How do you know the ranking logic is correct, unless you can read the code and understand every choice it made?
Note: From Lecture 2. The key distinction: AI-generated code that builds software is testable: it either works or it doesn't. AI-generated code that processes data to inform decisions brings back all the data trust problems. --- # Professional Practice --- ## Core vs peripheral competencies
Core (build skill here)
- Data cleaning and validation - Formula logic and spreadsheet work - Judging whether results make sense - Understanding what the data represents These require **your judgment**. AI cannot replace it.
Peripheral (AI can handle)
- Writing HTML and CSS - Memorising git commands - Producing visual polish - Code syntax These are mechanical. AI does them well.
Best AI use: ask **"what function would I use for this?"**, not "do it for me." Note: From Day 4. As a data practitioner, knowing whether data looks right is essential. Knowing how to write HTML is not. AI teaches tool mechanics; the instructor teaches which tools matter and why. Recognising that index.html is the homepage of your website is enough; you don't need to write it yourself. --- ## Build good taste in data If you rely on AI to do all the data work, you will never develop **intuition** for what right data looks like and what wrong data looks like. - Sort the column. Scroll through. Does anything stand out? - Check the min and max. Do they make sense? - Count the categories. Are there too many? This is why we do manual spreadsheet work, even when AI could do it faster. **You are building judgment**, not just producing output. AI gets lazy with large files: it reads the first 50 rows carefully, assumes the rest are similar, and may make up or skip data afterwards. Verify values near the end of outputs, not just the top. Note: From Day 4. "You need to build a good taste in data." The deepest reason for manual cleaning exercises. Also: AI laziness with large files is a concrete failure mode students should watch for. --- # Quick Reference --- ## The pipeline at a glance
Define, Find, Get, Verify, Clean, Analyse, Present
| Step | Key question | |---|---| | **Define** | What specific question am I trying to answer? | | **Find** | Where is the data that could answer it? | | **Get** | How do I retrieve it into my workspace? | | **Verify** | Is this data what I think it is? | | **Clean** | Is it ready for analysis? | | **Analyse** | What does the data say about my question? | | **Present** | How do I communicate this to my audience? |
Note: Reference slide. Students can return to this when they need a quick reminder of the pipeline steps and what each one asks. --- ## The prompt framework at a glance
{objective}What you want to achieve.{/objective} {steps}What the AI should do, in what order.{/steps} {output}What the result should look like.{/output} {reasoning}Why: explain your approach.{/reasoning}
The fifth element, **human verification**, is how you will check the result. Design this before you start, not after. Note: Reference slide. The colour coding matches the cli-highlight-prompt component used throughout the course slides. Students can refer to this when structuring their own prompts.