GeoCUP: supporting a flexible student computing environment

Over the past year, we’ve been supporting our first cohort of Geocomputation & Spatial Analysis (GSA) students as they learn to code and work with geo-data in an open computing context (predominantly FOSS). This post reflects on some of the problems – and solutions – that emerged as a result.

GeoCUP.v1

The first incarnation of GeoCUP (short for GeoComputation on a USB Platform) was a system-on-a-key described in a previous post. With the support of the Department and Faculty, USB keys were supplied to students at the start of term as follows:

  • 64GB USB 3.0 keys
  • Ubuntu Linux 14 LTS release (32-bit)
  • Pre-installed software:
    • R
    • QGIS
    • Canopy
    • Assortment of specified Python libs
    • Mozilla Firefox
    • Dropbox

The idea was that students could launch GeoCUP at boot time on a cluster machine from the USB key and would thus be running a full Linux distribution over which they had complete control. In an institutional computing context this was as close as we could get to giving them their own computer to play with, break, and manage.

We had also expected, based on what we’d seen with Linux ‘Live’ distributions that it would be feasible to have a key that would work with multiple types of firmware (including Apple’s EFI) and that students could therefore also run GeoCUP at home.

A final advantage would be the ease of replacing a lost key: since all their code was in Dropbox all they needed to do was reconnect Dropbox on a replacement key and they’d be up and running again in no time.

Ubuntu Screen Grab

Unexpected Issues

No well-laid plan survives much contact with the real world, and several issues emerged in the run-up to launch day:

  1. It is not (yet?) possible to have a full Linux distribution (as opposed to an essentially static ‘live’ distribution) that will start up at boot time on both Macs and PCs. Indeed, there are also issues with different vendors’ PC hardware being different enough from the machine on which GeoCUP.v1 was developed for this facility to be patchy, at best, on generic PCs as well. So portability proved to be rather more limited than we’d expected and hoped.
  2. Formatting the keys took much longer than expected. Since the keys needed to be bootable, the only way to write them was using the ‘disk duplication’ utility; however, dd is not able to distinguish between largely empty space and used space since it’s blindly copying the entire disk. So even though only about 20GB of the 64GB was in actual use, each key took about 5 hours to write. We were able to write up to 7 keys at once by combining dd with tee: as follows:
    dd 
    if=/Volumes/GeoCUP/geocup-20150917.bak/backup bs=524288 
    | sudo tee 
    /dev/disk3 ... /dev/disk9 > /dev/null
    

    We’d also note that using dd meant that we could only use 64GB USB keys, so if students lost a key and needed to replace it, they had to source exactly the same-sized key.

These start-up issues were then supplemented by performance issues after roll-out:

  1. Hardware buffering was much worse than expected. We had, naively, assumed that USB3 would provide sufficient bandwidth for our purposes and that read/writes would be fairly modest. We were wrong: the system frequently blocked completely for up to 10-12 seconds while data was written to/read from the USB key, and the entire Linux UI became unresponsive… which was rather frustrating for the students.
  2. As well, the pace of I/O usage of a full Linux distribution had a propensity to expose any physical weaknesses in the flash devices, so we had to re-flash probably 10–20% of the students’ keys over the course of the year.
  3. These performance issues then led some students to begin using their own laptops running OSX or various flavours of Windows instead, producing a proliferation in the number of students using the wrong Python libraries as platform support on some geodata and spatial analysis libraries is limited.
  4. All of this was compounded by the fact that some students were remembering to run
    sudo apt-get update

    on a regular basis, while others didn’t. So we even ended up with different versions of libraries on GeoCUP itself, and that led to code that would fail to run on one system but have no issues on another.

  5. A final ‘nail in the coffin’ of GeoCUP.v1 was the fact that one of our Ubuntu repositories was accidentally pointing at a development repository, not the stable one, and so one of the updates knocked out most of QGIS’ modelling functionality!

These were all serious issues, but in spite of them there were a number of students who reported that using GeoCUP had nonetheless helped the module as it gave them full control of their system, exposed them to power-user features such as the bash shell, and opened their eyes to some of the practical problems entailed in managing a system and a codebase. They also got to watch us doing some fairly frenetic on-the-fly debugging.

So with that in mind…

GeoCUP.v2

Virtualbox_logo

Part way through the year we began to experiment with Oracle’s VirtualBox platform as a way to enable students to run GeoCUP on their own computers (as that had signally not happened with GeoCUP.v1). Although there are higher-performance virtualisation platforms out there, VirtualBox is free, open source software so there were no licensing or cost implications to rolling this out on cluster systems or in suggesting that students download it to their personal computer.

GeoCUP.v2 is built as follows:

  • Ubuntu Linux 16 LTS (64-bit)
  • Anaconda Python
  • Rodeo & Atom IDEs
  • Dropbox
  • Google Chrome
  • QGIS

We’ve adapted installation scripts posted by Dani, up at Liverpool University for use with our own GeoCUP distribution since this speeds up the configuration and updating of the system as new Ubuntu distributions are released. You find them on GitHub: github.com/jreades/GeoCUP-Vagrant.

The main advantages of this shift are:

  1. The VDI (Virtual Disk Image) file is decoupled from the physical storage media, so as long as the image fits on the device then students can bring in whatever hardware they like (hard drive, flash drive, personal computer…) and run GeoCUP from that hardware.
  2. The VDI file is smaller and copying to new hardware uses the normal file copying mechanisms so ‘installation’ is also radically faster (we also only copy 20GB of data, instead of 64GB).
  3. By ditching Canopy for Anaconda we can also ‘fix’ the Python libraries using a configuration file so as to avoid last-minute problems caused by the release of new versions. We can then update those libraries to new, stable versions by distributing an upgrade script to the students rather than relying on manually-typed commands.

Alongside this, however, we retain the flexibility to give students administrator rights over their (virtual) machine, to install new software on the fly, and to take advantage of software updates without having to embed them in a centralised IT upgrade cycle. We also think that the virtualisation approach has significant advantages for IT services because they don’t have to monkey about with the BIOS of the cluster machines since the entire process is now software-based.

GeoCUP.v3 & Beyond

In the long run we’d like to automate even more of the distribution process so that we are no longer even responsible for ‘burning’ new USB keys or given students a drive from which to copy the latest version of GeoCUP.

Tools that enable just this sort of approach are beginning to surface: Vagrant and Docker are the two leading contenders at the moment, though they do slightly different things. I’ve been impressed by the way that Dani’s Vagrant-based distribution allows you to download a 2GB file containing a full Linux server distribution, have it automatically configured when it first runs, and then interact with the system via Jupyter Notebooks: it’s a fairly lightweight, but fully-functional Python-based geodata analytics ‘server’.

There are several problems with using this approach in our context:

  1. I’ve had a lot of problems getting Vagrant to also run in a ‘headed’ context, and since we want students to use the latest versions of QGIS as well as unsupported (by IT Services) IDEs such as Rodeo or Atom, we can’t drop the Linux desktop entirely and just run the notebook server.
  2. We can’t have students downloading even a 2GB file on to the cluster machines since a) they have nowhere to keep it in their allocated 200MB of online storage, and b) multiplying that 2GB overhead by 30 students is suddenly quite a big ‘hit’ to the network at the start of every class.
  3. We also can’t run Jupyter on a server somewhere on campus since every users runs with the same permissions as Jupyter and there’s no separation of user spaces as I understand it.

I suspect that these issues will be remedied in the not-too-distant future, and James and I will be exploring some of the possibilities with colleagues at ASU and UNSW over the coming year.

Finally, a /ht to Ryan Barnes, one of our own Geography grads who did the heavy lifting on version 1 of GeoCUP.


‘GeoCUP’: Linux System-on-a-Key for Geospatial Analysis Project

We have received funding to develop a system for managing and distributing a full Linux system-on-a-key to students on our new undergraduate pathway. We are looking for an Informatics student (PhD, MSc, or BSc) to research, recommend, develop and test an appropriate solution that meets our needs. Read on for more information.

Background

This Autumn, the Department of Geography is launching an innovative new undergraduate ‘pathway’ in Geocomputation and Spatial Analysis (GSA). The pathway responds to a recognised gap not only in our own module offerings, but across the offerings of UK universities as a whole: the need for geographers with the programming skills to process ‘big geo-data’ using Free and Open Source Software (FOSS) and able to tackle pressing geographical challenges in commercial, governmental, and third-sector data analysis and visualisation.

Effective delivery of this pathway will require students to store and manipulate large data sets, to install and manage new ‘code libraries’ and applications on-demand and as-needed, and to be able to collaborate flexibly on- and off-line across multiple platforms (mobile, personal, and institutional). Within the constraints of managed IT infrastructure these needs can only be met through the use of ‘bootable’ USB flash drives that provide a platform on which open-source geocomputation and spatial analysis tools can be hosted and run.

To meet this need this project will develop the GeoComputation USB Platform (GeoCUP). GeoCUP will allow students to manage and run a Linux-based operating system over which they have full administrative control. This capability is integral to successful learning on the GSA pathway as the innovative nature of student assignments and independent projects requires the use of compiled open source software libraries and tools.

This project therefore seeks to research, configure, develop, and test a management strategy to support this bootable USB flash drive approach so that it: i) enhances student experience of the College’s computing environment; ii) minimises the maintenance demands on staff as this approach cannot be supported by central IT; and iii) creates opportunities for other staff to deploy a similar system when flexibility and agility in computing are called for.

Swiss Army knife with USB key (swissarmy365.co.uk).

Objectives

There are several overarching objectives for how GeoCUP will improve the student learning experience:

  1. An operating system over which students have full control will allow them to maintain and customise their individual instance of GeoCUP to suit their personal computing needs. As the students develop competence in programming and analytical techniques, they will begin to pursue separate, distinct challenges requiring the ability to compile and install code libraries, or even entirely new applications, on-the-fly. This is impossible to achieve in a traditional, tightly-managed computing environment context.
  2. We will be able to maintain and update the ‘master version’ of GeoCUP so that incoming students to the pathway will always be working with the most up-to-date system possible. In addition, should a student lose a USB drive or suffer some other type of data loss, we will be able to quickly provide them with a fully-functioning and up-to-date version of GeoCUP from which to recover. We will also be able to enforce data-protection requirements such as the use of encrypted partitions to ensure that the USB flash drives are unusable and inaccessible without the student’s password.
  3. GeoCUP will be configured with the full set of programming support tools needed to ensure the development of computational (spatial) data analysis skills, including not only Enthought Canopy and QGIS, but also open collaboration and development tools used by technology firms such as PayPal and Google. Many of the required tools are not available at all through managed IT systems, these include: the GitHub versioning tool; the Postgres+PostGIS spatial database; the routino routing application; the RStudio IDE; Dropbox; and the Slack collaboration tool, amongst others. Our intention is to promote students’ employability by grounding their experience in a realistic computing environment as used by commercial and other organisations.

As a result of the ‘real world’ environment GeoCUP will provide, incidental – but by no means insignificant – benefits to student experience, including:

  1. The ‘Slack’ collaboration system functions on all computing platforms, including all major mobile ones, and creates a series of ‘channels’ across which students and staff can communicate in a way that more closely mirrors student preferences: content (including code) is ‘pushed’ in real-time to all devices, can be categorised using hashtags, and serves as a instantly-searchable archive of interactions. This complements the 1-to-1 and 1-to-many format of email and the KEATS ‘broadcasting’ tool, and is expected to encourage dynamic peer support and collaboration, while avoiding repeated “Can you tell me…” messages to staff.
  2. The GitHub version control platform is now the de facto standard for collaborative programming projects in all sectors. It also brings the additional benefit of mitigating data loss in the event of corruption, loss of a USB flash drive, or other unforeseen events. We will therefore be reinforcing for students the importance of integrating code-management into their workflow.
  3. Students will also be able to take advantage of more open, platform-independent cloud-computing resources such as Dropbox and Amazon Web Services (AWS), which is not possible on the existing Microsoft-based SharePoint solution.

More Information

Selected researcher will be paid in accordance with King’s College London guidelines. Project work can begin immediately and must be complete by late-August.

For more information about the project timeline and for expressions of interest (by Thurs 25 June), please contact Jonathan Reades or James Millington in the Department of Geography.