The Full Stack: Tools & Processes for Urban Data Scientists

Recently, I was asked to give talks at both UCL’s CASA and the ETH Future Cities Lab in Singapore for students and staff new to ‘urban data science’ and the sorts of workflows involved in collecting, processing, analysing, and reporting on … Continue reading 

GeoCUP: supporting a flexible student computing environment

Over the past year, we’ve been supporting our first cohort of Geocomputation & Spatial Analysis (GSA) students as they learn to code and work with geo-data in an open computing context (predominantly FOSS). This post reflects on some of the problems – and solutions – that emerged as a result.

GeoCUP.v1

The first incarnation of GeoCUP (short for GeoComputation on a USB Platform) was a system-on-a-key described in a previous post. With the support of the Department and Faculty, USB keys were supplied to students at the start of term as follows:

  • 64GB USB 3.0 keys
  • Ubuntu Linux 14 LTS release (32-bit)
  • Pre-installed software:
    • R
    • QGIS
    • Canopy
    • Assortment of specified Python libs
    • Mozilla Firefox
    • Dropbox

The idea was that students could launch GeoCUP at boot time on a cluster machine from the USB key and would thus be running a full Linux distribution over which they had complete control. In an institutional computing context this was as close as we could get to giving them their own computer to play with, break, and manage.

We had also expected, based on what we’d seen with Linux ‘Live’ distributions that it would be feasible to have a key that would work with multiple types of firmware (including Apple’s EFI) and that students could therefore also run GeoCUP at home.

A final advantage would be the ease of replacing a lost key: since all their code was in Dropbox all they needed to do was reconnect Dropbox on a replacement key and they’d be up and running again in no time.

Ubuntu Screen Grab

Unexpected Issues

No well-laid plan survives much contact with the real world, and several issues emerged in the run-up to launch day:

  1. It is not (yet?) possible to have a full Linux distribution (as opposed to an essentially static ‘live’ distribution) that will start up at boot time on both Macs and PCs. Indeed, there are also issues with different vendors’ PC hardware being different enough from the machine on which GeoCUP.v1 was developed for this facility to be patchy, at best, on generic PCs as well. So portability proved to be rather more limited than we’d expected and hoped.
  2. Formatting the keys took much longer than expected. Since the keys needed to be bootable, the only way to write them was using the ‘disk duplication’ utility; however, dd is not able to distinguish between largely empty space and used space since it’s blindly copying the entire disk. So even though only about 20GB of the 64GB was in actual use, each key took about 5 hours to write. We were able to write up to 7 keys at once by combining dd with tee: as follows:
    dd 
    if=/Volumes/GeoCUP/geocup-20150917.bak/backup bs=524288 
    | sudo tee 
    /dev/disk3 ... /dev/disk9 > /dev/null
    

    We’d also note that using dd meant that we could only use 64GB USB keys, so if students lost a key and needed to replace it, they had to source exactly the same-sized key.

These start-up issues were then supplemented by performance issues after roll-out:

  1. Hardware buffering was much worse than expected. We had, naively, assumed that USB3 would provide sufficient bandwidth for our purposes and that read/writes would be fairly modest. We were wrong: the system frequently blocked completely for up to 10-12 seconds while data was written to/read from the USB key, and the entire Linux UI became unresponsive… which was rather frustrating for the students.
  2. As well, the pace of I/O usage of a full Linux distribution had a propensity to expose any physical weaknesses in the flash devices, so we had to re-flash probably 10–20% of the students’ keys over the course of the year.
  3. These performance issues then led some students to begin using their own laptops running OSX or various flavours of Windows instead, producing a proliferation in the number of students using the wrong Python libraries as platform support on some geodata and spatial analysis libraries is limited.
  4. All of this was compounded by the fact that some students were remembering to run
    sudo apt-get update

    on a regular basis, while others didn’t. So we even ended up with different versions of libraries on GeoCUP itself, and that led to code that would fail to run on one system but have no issues on another.

  5. A final ‘nail in the coffin’ of GeoCUP.v1 was the fact that one of our Ubuntu repositories was accidentally pointing at a development repository, not the stable one, and so one of the updates knocked out most of QGIS’ modelling functionality!

These were all serious issues, but in spite of them there were a number of students who reported that using GeoCUP had nonetheless helped the module as it gave them full control of their system, exposed them to power-user features such as the bash shell, and opened their eyes to some of the practical problems entailed in managing a system and a codebase. They also got to watch us doing some fairly frenetic on-the-fly debugging.

So with that in mind…

GeoCUP.v2

Virtualbox_logo

Part way through the year we began to experiment with Oracle’s VirtualBox platform as a way to enable students to run GeoCUP on their own computers (as that had signally not happened with GeoCUP.v1). Although there are higher-performance virtualisation platforms out there, VirtualBox is free, open source software so there were no licensing or cost implications to rolling this out on cluster systems or in suggesting that students download it to their personal computer.

GeoCUP.v2 is built as follows:

  • Ubuntu Linux 16 LTS (64-bit)
  • Anaconda Python
  • Rodeo & Atom IDEs
  • Dropbox
  • Google Chrome
  • QGIS

We’ve adapted installation scripts posted by Dani, up at Liverpool University for use with our own GeoCUP distribution since this speeds up the configuration and updating of the system as new Ubuntu distributions are released. You find them on GitHub: github.com/jreades/GeoCUP-Vagrant.

The main advantages of this shift are:

  1. The VDI (Virtual Disk Image) file is decoupled from the physical storage media, so as long as the image fits on the device then students can bring in whatever hardware they like (hard drive, flash drive, personal computer…) and run GeoCUP from that hardware.
  2. The VDI file is smaller and copying to new hardware uses the normal file copying mechanisms so ‘installation’ is also radically faster (we also only copy 20GB of data, instead of 64GB).
  3. By ditching Canopy for Anaconda we can also ‘fix’ the Python libraries using a configuration file so as to avoid last-minute problems caused by the release of new versions. We can then update those libraries to new, stable versions by distributing an upgrade script to the students rather than relying on manually-typed commands.

Alongside this, however, we retain the flexibility to give students administrator rights over their (virtual) machine, to install new software on the fly, and to take advantage of software updates without having to embed them in a centralised IT upgrade cycle. We also think that the virtualisation approach has significant advantages for IT services because they don’t have to monkey about with the BIOS of the cluster machines since the entire process is now software-based.

GeoCUP.v3 & Beyond

In the long run we’d like to automate even more of the distribution process so that we are no longer even responsible for ‘burning’ new USB keys or given students a drive from which to copy the latest version of GeoCUP.

Tools that enable just this sort of approach are beginning to surface: Vagrant and Docker are the two leading contenders at the moment, though they do slightly different things. I’ve been impressed by the way that Dani’s Vagrant-based distribution allows you to download a 2GB file containing a full Linux server distribution, have it automatically configured when it first runs, and then interact with the system via Jupyter Notebooks: it’s a fairly lightweight, but fully-functional Python-based geodata analytics ‘server’.

There are several problems with using this approach in our context:

  1. I’ve had a lot of problems getting Vagrant to also run in a ‘headed’ context, and since we want students to use the latest versions of QGIS as well as unsupported (by IT Services) IDEs such as Rodeo or Atom, we can’t drop the Linux desktop entirely and just run the notebook server.
  2. We can’t have students downloading even a 2GB file on to the cluster machines since a) they have nowhere to keep it in their allocated 200MB of online storage, and b) multiplying that 2GB overhead by 30 students is suddenly quite a big ‘hit’ to the network at the start of every class.
  3. We also can’t run Jupyter on a server somewhere on campus since every users runs with the same permissions as Jupyter and there’s no separation of user spaces as I understand it.

I suspect that these issues will be remedied in the not-too-distant future, and James and I will be exploring some of the possibilities with colleagues at ASU and UNSW over the coming year.

Finally, a /ht to Ryan Barnes, one of our own Geography grads who did the heavy lifting on version 1 of GeoCUP.