ICT for Civic Data — Turin University 2025-26
# Enrichment
and Cleaning
Crash Course — Day 3
civicliteraci.es
--- # Day 3 Note: "Yesterday you defined your angle and started finding data. This morning you finish that work, then we learn how to enrich a map by combining datasets. This afternoon: cleaning messy data." --- ## Today's plan **Morning:** - Finish your individual data search and map from yesterday - Shared tutorial: enrich the flood map with health facility data from OpenStreetMap **Afternoon:** - Verification and cleaning exercise in Google Sheets using the FloodArchive data Note: Two distinct blocks. Morning is building on the map work. Afternoon switches to spreadsheet work on data quality. --- # Finish Your Map Note: "First, let's give you time to finish what you started yesterday. If you already have a layer on your map, keep improving it. If you're stuck, ask for help." --- ## Continue from yesterday Pick up where you left off: - If you have an angle but no data yet: **Find** a source, **Get** it, put it on the map - If you have data but it's not on the map yet: ask Gemini CLI to add it as a layer - If your map works: improve it (popups, colours, legend) or add a second dataset Use the **chatbot** for questions and discovery, the **CLI** for file work. Remember: ask for a **"simple Leaflet map"** to avoid the agent over-engineering. Note: Give 45-60 minutes for this. Circulate and help. Priority: everyone should have at least one layer on their map before we move to the enrichment exercise. Help stuck students first. --- ## Reminder: common issues and solutions - **Agent publishes to wrong branch?** Tell it "publish on the main branch, not gh-pages" — or change Settings → Pages to use the gh-pages branch - **Agent thinking too long?** Press **Escape** (maybe multiple times), then ask it why it's taking so long - **Map not updating?** Hard refresh: **Ctrl+Shift+R** (Windows) or **Cmd+Shift+R** (Mac) - **Agent fails to download a file?** Download it yourself and drag-and-drop it into the Codespace - **Agent builds something too complex?** Ask for a **"simple Leaflet map"** Note: Quick reminder from yesterday. Project this slide while students work so they can refer to it. --- # From Case Study
to Demonstration Note: "Now let's do something more ambitious together. But first, let's walk through the thinking process: how do we go from a case study idea to something we can actually build and show?" --- ## The case study Health facilities in flood-prone areas are at risk. When a flood hits, the facilities that should serve the community may themselves be damaged, understaffed, or inaccessible. This is a **concrete case study**: specific, grounded, observable. Note: This is the same structure as Day 2. Start from a concrete case, not an abstract idea. --- ## From case study to angle to proposal
The angle
Knowing which facilities are at risk helps with: - **Prevention**: strengthen facilities *before* the next event - **Staff safety**: contingency plans for health workers in flood zones - **Response**: knowing which facilities are safe supports evacuation and shelter decisions
The proposal
A **facility risk dashboard** that field staff can use on the go during response, showing which facilities are in flood zones and their status. Built on data. Grounded, realistic, approachable, impactful.
Note: Walk through the logic from Day 2: case study → angle → proposal. The case study is specific (facilities at risk). The angle is the broader theme (prevention, staff safety). The proposal is what you deliver (facility assessment using data). Same pattern students should follow for their own work. --- ## What do we build to demonstrate this? The proposal says: "we can help you identify at-risk facilities." The demonstration says: **here, look — we already did it for one country.** We will build a map that shows flood events and health facilities, with a filter to highlight facilities in flood risk zones. That map is the "show don't tell" artifact. Note: Connect back to Day 1: the strongest response demonstrates capacity, not just describes it. The map with the filter is the tangible proof that the team can do the work. --- ## What we will build A map with: - Flood events from FloodArchive (filtered to one country) - Health facilities from OpenStreetMap - A **filter** (dropdown or checkbox) to show only facilities within 1km of a flood event To get there, we need to go through several pipeline steps. Let's walk through them. Note: Preview the end result so students know where they're going. The filter is the "shiny deliverable" that makes this feel like a real tool. --- ## Before we start: how to prompt for complex tasks By now you have used your AI tool for many tasks. Some worked well. Some didn't. When a task has multiple steps, the prompt needs **structure**. Five elements: 1. **Objective**: what are you trying to achieve? 2. **Steps**: what should the AI do, in what order? 3. **Reasoning**: why each step? (so the AI doesn't skip or improvise) 4. **Output shape**: what should the result look like? 5. **Human verification**: how will you check the result? A prompt without these elements is an invitation for the AI to fill in the blanks. Note: This is the task framework from Lecture 2, now applied after students have enough hands-on experience to appreciate it. They've seen what happens when they give vague prompts. Now they see the structured alternative. We will apply these five elements to each step of the exercise. --- ## Trust the code, not the reasoning
Trust
The AI's **technical capability**: it knows how to write code, query APIs, convert formats. Scripts are **deterministic**: the same script on the same data produces the same result every time.
Don't trust
The AI's **reasoning about data content**: it does not understand what the data means. Never ask the AI to directly modify the content of your data. Ask it to write a script that does it. You review the script.
Note: Core principle. The AI can write a script to filter rows by country (deterministic, verifiable). But if you ask it to "clean the data" directly, it may change values based on its own assumptions. To modify a file with 5,000 rows, are you going to check all rows? That's why scripts matter. --- ## AI gets lazy with large files When processing large datasets, AI models tend to: - Read the first 50 rows carefully - **Assume** the rest is similar - Make up or skip data after that This is why we break work into steps and verify at each stage. If the AI handles your whole dataset in one pass, the end of the file is less reliable than the beginning. Note: This is a concrete failure mode students should watch for. If a script processes 5,000 rows, check values near the end of the output, not just the top. If the AI summarises a dataset, it may miss patterns that only appear in later rows. --- ## Step 1: Get the flood data for one country If you don't already have the FloodArchive file, upload it to your repo. Ask Gemini CLI to **write a script** that filters it to your chosen country and saves the result.
{objective}Write a script that reads FloodArchive_clustered.csv and keeps only the rows where Country is [your country].{/objective} {output}Save the result as a new CSV file.{/output} {reasoning}I'm a beginner, please explain what you're doing in simple terms.{/reasoning}
This is a **Clean** step: reducing the dataset to what you need. Note: Simple first step to get everyone started. The script is a cleaning operation: filtering. Students should check the output: how many rows? Does the country name match exactly (watch for the spelling variants we saw in the data)? --- ## Step 2: Create a map for your country Ask Gemini CLI to build a simple Leaflet map from the filtered data.
{objective}Create a simple Leaflet map that can be published on GitHub Pages.{/objective} {output}Display the flood events from [your filtered CSV] as circles. Add popups with date, cause, and severity. Center the map on [your country].{/output} {steps}Publish it on GitHub Pages from the main branch.{/steps} {reasoning}I'm a beginner, please explain what you're doing in simple terms.{/reasoning}
Push and check the live page. This is a **Verify** step: does the data look right on the map? Note: Emphasise "simple Leaflet map that can be published on GitHub Pages" to prevent over-engineering. Students should check: are the points in the right country? Does the count seem right? Any obvious outliers? --- ## Step 3: Get health facilities from OpenStreetMap OpenStreetMap has health facility locations worldwide via the **Overpass API**: - **No registration, no API key** - Queries can take **30+ seconds** for larger countries. That is normal.
{objective}Query the Overpass API for all hospitals and clinics in [your country].{/objective} {output}Save the result as a GeoJSON file.{/output} {reasoning}I'm a beginner, please explain what you're doing in simple terms.{/reasoning}
This is a **Get** step. Check: how many facilities? Does the number seem reasonable? Note: Overpass is the query engine for OpenStreetMap. The query syntax is complex but Gemini writes it. Students just describe what they want. Kenya has ~1,800 facilities in OSM. A country with only 20 results means poor OSM coverage. --- ## Step 4: Add facilities to the map
{objective}Add the health facilities from [your GeoJSON file] as a new layer on my Leaflet map.{/objective} {output}Use a different colour from the flood events. Add popups with the facility name and type.{/output} {steps}Push and publish on GitHub Pages from the main branch.{/steps}
Push and check. This is another **Verify** step: do facilities appear where you expect them? Do flood events and facilities overlap in some areas? Note: Two datasets on the same map. The visual overlap is what we're looking for: areas where both floods happen and health facilities exist. --- ## Step 5: separate at-risk facilities Now the key step. Ask Gemini to **write a script** that identifies which facilities are near flood events.
{objective}Write a script that takes the health facilities GeoJSON and the filtered flood CSV.{/objective} {steps}For each facility, check if any flood event occurred within 1km.{/steps} {output}Produce two output files: one with facilities within 1km of a flood event, one with the rest.{/output} {reasoning}Explain your approach before running the script.{/reasoning}
Note: This is where the five-element prompt structure pays off. The objective is clear (separate at-risk from not-at-risk). The output shape is defined (two files). The verification is built in (explain before running). The agent will likely use haversine distance. Students should read the explanation and confirm it makes sense before approving. --- ## Step 6: add a filter to the map The final step: an interactive filter.
{objective}Update my Leaflet map to load both facility files (at-risk and not-at-risk).{/objective} {output}Show at-risk facilities in red and others in blue. Add a checkbox or dropdown filter that lets the user show only at-risk facilities.{/output} {steps}Publish on GitHub Pages from the main branch.{/steps} {reasoning}Keep it simple and explain your choices.{/reasoning}
Push and check. You now have an interactive map that answers the question: **which health facilities are in flood risk zones?** Note: The filter is the deliverable. It turns the map from a visualisation into a tool. A funder can look at this and immediately see which facilities need attention. This is "show don't tell" in action. --- ## Exercise: build the enriched map **Individual work — follow the six steps above** Work through each step in order. At each step: - Read what the agent proposes before approving - Check the output before moving on - If something goes wrong, go back one step The goal is the interactive map with the filter. But getting through steps 1-4 is already valuable. Note: Main exercise. Circulate and help. Some students will get through all six steps. Some will get stuck at step 3 or 5. Both are fine. The important thing is they experience the structured approach: each step has a clear input, output, and verification. --- ## The five elements in practice Look at what we just did: | Element | How we used it | |---|---| | **Objective** | "Which facilities are in flood risk zones?" | | **Steps** | Filter → Map → Get facilities → Add layer → Separate → Filter | | **Reasoning** | Each step has a clear pipeline role (Get, Clean, Verify) | | **Output shape** | Two GeoJSON files + an interactive map with filter | | **Verification** | Check the map at every step. "Explain your approach" before scripts. | When you give the AI all five, the result is **structured, verifiable, and documented**. Note: Connect back to the prompt structure slide. The five elements are not abstract theory: students just used them. Each step had a clear objective, a defined output, and a verification method. This is the discipline that makes AI-driven work trustworthy. --- ## Build data intuition before visualising Before asking the AI to make charts or maps, **look at the data yourself** in a spreadsheet. - Sort columns. Scroll through. What stands out? - What are the min and max values? Do they make sense? - How many unique values are in each column? This is how you develop intuition: "oh, I could visualise it like this." The intuition comes from **seeing the data**, not from asking the AI what to do with it. Note: Reinforce the habit from Day 2. At no moment did we ask the AI to analyse the data for us. The AI handles technical tasks. The analytical judgment comes from you looking at the data and understanding what matters. --- ## When data sources disagree You may find two sources that show **different numbers** for the same thing. Be careful: they may have been built with **a different objective in mind**. Different collection methods, different definitions, different time periods. They may **both be true**, even though they show a different reality. Document the discrepancy, explain which source you chose and why. Note: This comes up when combining flood data from different sources, or comparing health facility counts between OSM and government registries. Both can be valid. The key is acknowledging and explaining the difference, not pretending it doesn't exist. --- ## What we covered this morning - Finished individual map work from yesterday - Learned the five-element prompt structure for complex AI tasks - Built an enriched map: flood events + health facilities + interactive filter - Practiced Get, Clean, and Verify as separate steps with clear outputs **This afternoon:** we look at the FloodArchive data up close and clean it in Google Sheets. Note: Quick recap before lunch.