Blackbox is now live

About a year ago, I wrote about a new project we were starting: Blackbox. Back then, we had a plan and some goals. Now we have reached the next step: Blackbox is now live.

A few weeks ago, we released BlueJ 3.1.0, which now incorporates Blackbox. We are now collecting data from all participating BlueJ users with the goal of making that data available to research projects. Neil Brown (@twistedsq), who has done most of the work on the implementation, has just published a blog post on this – go and read it for more detail.

In short, we are currently collecting data from 25,000 users who have agreed to participate, and expect to extend that to over 100,000 within three months.

Some more information about the data collected and how it is done, look here.

If you are a computing education researcher at a recognised research institution, you can request access to this data. To do this, please mail us at


The Blackbox servers are here

Here they are:

blackbox servers

The new Blackbox servers (and one more for BlueJ)

These are the two servers that will run the Blackbox project. They are two Dells, each with 12 cores (24threads), 32G RAM, 2x 500G HD to mirror OS and 4x 2TB HD to get a 6TB Raid 5.

In the picture are actually three machines: The two Blackboxes are at the top. Below that, somewhat smaller, is a new server for our research group. Currently, we are running one machine that serves the BlueJ website, the Greenfoot website (including the Greenfoot Gallery), the Greenroom, the Blueroom, the CAS public website, the CAS Online site, the two book websites, our source repositories (subversion) for BlueJ, Greenfoot, and other projects, our trac site, various mailing lists, and a whole lot more. And all that on a machine that’s about seven years old with a whopping 2GB of memory — my laptop has twice as much!

SIGCSE grant for Blackbox

Blackbox logoGreat news: a few weeks ago, we were awarded a ACM SIGCSE special projects grant to support the Blackbox project. We had applied for this grant to help pay for two servers that will be central to the project. One of them will collect the incoming data, and the other one will host a mirror of the database for researchers to work on and run queries and other evaluations.

We did some worst case calculations (worst case here meaning: close to 100% opt-in ratio of our users, which — in another sense — is really a best-case scenario). According to this we could have up to about 40 incoming network connections per second and generate about 3TB of data per year. That’s more that our current BlueJ server can handle. Way more.

So, thanks SIGCSE!

Project Blackbox – Better data for programming education research

Blackbox logoA lot of people are very aware that programming education is still quite difficult. We are a young discipline, and pedagogical principles are not genrally very well established. Many people teach programming, and generally how they do it is based on gut feeling. There are many grey areas where we just don’t know what works and what doesn’t, or why something works, or — most importantly — how to improve our teaching.

For example, students often fall into a bimodal distribution in programming classes: Some learn programming quite easily, and some are really struggling to “get it”. So far, there are many theories why this may be, and a good number of studies, but nobody really knows why this is.

It’s similar for the design of educational tools and environments – do we really know which aspects have an effect and which don’t? No, we don’t. Our discipline is really only at a beginning of an understanding of how people learn to program.

There has been work in this area for some time. Many people have studied data about early programming interaction. Getting your hands on this data can be hard work. Often, researchers collect data (either interaction data from the computer system, or interview or observation data), and then evaluate it. If the class is small, it is sometimes hard to be sure how much the results can be generalised. Collecting larger data sets, however, is hard, because most teachers have access only to their own students.

In our BlueJ project group, we discussed some time ago that we are in a fairly unique position to be able to gather data. BlueJ has a large user community, and there is potential to make use of this to further our work. And not only BlueJ development specifically, but programming education research in general.

So, some time last year we decided to initiate a new project: Project Blackbox.

The Blackbox idea is to collect data about the way beginners interact with BlueJ, and to make this data available to any interested research group to conduct their own studies with it.

For BlueJ users, this would only happen with explicit consent (opt-in) even though the data collected will be entirely anonymous. For researchers, we hope that this may create a treasure trove of data that might spark research that was not previously possible.

BlueJ is currently being downloaded over 2 million times a year, and has over 200,000 active users every month. Even if only 10% of users were to opt in to our project, we are still looking at hundreds of thousands of sessions per month, generating millions of interaction events.

We presented this idea at a special session at the last SIGCSE conference (session abstract, subscription needed), and several people expressed an interest.

So, we have now started on the design and implementation of this system, and I will occasionally give you an update here on my blog. If you are interested to keep a closer eye on it or get involved in the design discussions, you are welcome to join our mailing list for the Blackbox project.