CUSP London Seminar  – Dani Arribas-Bel 

Thursday 24th January 2019

Article by Jon Reades

This past Thursday we were really lucky to catch Dani Arribas-Bel, Senior Lecturer in Geographic Data Science at the University of Liverpool and major contributor to PySAL, on his way back home following two weeks’ teaching in the Caribbean. Dani kindly agreed to give a talk in two parts on “Infusing Urban and Regional analysis with Geographic Data Science” (‘GDS’) which we will summarise below… As one of the first CUSP London-branded seminars, it was great to see so many Urban Informatics staff and students there (and even a few from UCL’s CASA!).

Geography & Computers

The first half of Dani’s talk covered highlights from a recently-published paper in Geography Compass titled “Geography & Computers: Past, Present, and Future” (an author pre-print is available via KCL’s Institutional Repository); in it, Dani and KCL’s Jon Reades link shifts in computing power and access to shifts in the ways in which geographers use computers to ‘do’ geography.

The basic contention is that there have been three waves of change that they (we) summarise as: 1) a computer in every institution (50s–70s); 2) a computer in every office (80s–00s); 3) a computer in every thing (10s–). We don’t need to revisit the article in full here since highlights are available in a previous blog post, but Dani’s focus was on the links to ‘data science’ the ‘sexiest job of the 21st century‘.

This led through to a discussion of ‘data-driven methods’ which, to a geographer, can sound like putting the cart before the horse. However, it’s important to keep in mind that we, as researchers, have little to no control over how the kinds of data underpinning a (geographic) data science are created and therefore need to adapt our approach to the data, and not the other way around.

I particularly appreciated Dani’s observation on the importance of data processing/handling as part of this shift: sometimes dismissed as ‘mere cleaning’, this stage is critical to ensuring that the data is both well-understood (shows what we think it shows) and fit-for-purpose (does what we want it to do).

I’ve seen the term ‘feature engineering‘ pop up in my own news feeds with increasing regularity and that has a nice ring to it (it’s engineering, not cleaning!) but it doesn’t quite capture the full scope of what good data science really entails. And it also doesn’t take into account the ‘baking’ of geo-data that is really required to ensure methods and models are appropriate.

Dani wrapped up this section with a discussion of how GDS can serve as the interface between geographers and data scientists, supporting the co-production of systems (a.k.a. tools), methods (spatially aware ML), and epistemologies (ways of knowing that are appropriate to these types of data).

Applications of Geographic Data Science

The second half of Dani’s talk covered a work-in-progress using a large building data set from Spain to delineate urban and employment boundaries. This nicely illustrated one of the key concepts elaborated in the first half of the talk: the importance of data-driven methods in geographical data science.

The question Dani and his co-authors are exploring is how one can meaningfully delimit the spatial extent of urban areas and economic activity with the minimum number of prior assumptions about spatial configuration or ‘auxiliary geographies’; by this we mean using other steps or data, such as rasterisation or regional boundaries, to constrain the process to our preconceived notions of what the answer ‘should be’.

The issues with rasterisation and the MAUP are well-known, but what do you do when you have 15 million data points to cluster and can no longer load the data set into memory? This is what we mean by data-driven methods: Dani’s exiting addition (which prompted a good deal of questioning from the audience) is a way to make an existing algorithm work not only in a large data context but which also does so in a way that works around what I feel is an important conceptual flaw in the existing algorithm to give you insights into the robustness of your results!

Such a method is not without theory, nor without empirical input: Dani and his colleagues use research findings on commuting distances and employment to provide essential parameters. I’m not able to share additional details at this stage, but I’m really looking forward to seeing this algorithm ‘in the wild’ since it addresses a number of issues that I have with some work that I’m (slowly) undertaking…