Pragmatic Geographer

SOFTWARE GEOGRAPHY ECONOMICS HISTORY

Crunch Mode Does Not Work

Killing the Crunch Mode Antipattern

If you want a “knowledge worker” to be as ineffective and produce the lowest level of quality possible, deprive them of their sleep and hold them to an unrealistic deadline….

It makes people lazy and less productive. This may seem ironic, but when someone puts in heroic levels of effort, they start to place less value on each minute.

The last crunch mode I experienced virtually none of the code could be salvaged. We did a reasonable review of the work and realized more than half would need to be significantly refactored.

It turns out that there has been some study around when a project should be rewritten vs. being subject to an extensive refactoring. It does depend, but generally the cut-off is about 20-25% – if you need to change more than that, you are likely better off just rewriting it (Thomas, Delis, Basili, 1997).

Access vs. Validity in Pragmatic HTTP APIs

It is obviously important to define the difference between the access a current client has to a resource and all of the valid methods that any client could potentially have.

This is the difference between the capabilities of a particular client (access) vs. the capabilities of the server (validity). Missing just one has significant problems:

Missing Access: Clients have to test every possible action to determine what they are actually allowed to do.

Missing Validity: The client knows what they can do, but not what the server can do. Should they need to look around to find a mechanism to be granted permission to act as they need?

And indeed, there appears to be a perfectly good HTTP header that means to address at least one of these:

RFC2616

The Allow entity-header field lists the set of methods supported by the resource identified by the Request-URI. The purpose of this field is strictly to inform the recipient of valid methods associated with the resource.

That appears to cover validity. But is there some other standard means of defining permitted actions? Must a user try and fail with HTTP 403 Forbidden (rather than HTTP 405 Method Not Allowed)?

I’ve come across three potential solutions.

  1. Pragmatism.

    If there isn’t any granting mechanism or such, just use the Allow header to capture access. Without having to make addition requests, a client immediately knows what actions it can take next.

  2. The OPTIONS method

    The what?

    The OPTIONS method…allows the client to determine the options and/or requirements associated with a resource, or the capabilities of a server, without implying a resource action or initiating a resource retrieval.

    Seemingly few servers implement this method and it is a damn shame, because it could largely eliminate verbose API documentation and vastly increase programability of a resource.

    You could define not just the permitted methods for that specific client, but also better define how to interact with the resource and describe how to potentially be granted additional access rights.

    Shout out to Django Rest Framework for including this out-of-the-box.

  3. Add an additional header

    Something like a Permitted header might make sense, but like all extra headers there are concerns – they are less easily discovered by clients, hostile intermediaries may drop them, or they might be superseded by modifications to the HTTP spec itself.

Some Benefits of Failure - Gallipoli

It is a common trope, in the software world and elsewhere, that failure can breed success. This is evident at the micro scale – failing code/tests are an inevitable starting condition for working code and passing tests.

At the more macro scale, there is an entire conference on the theme of startup failure and lessons learned from that failure. While no one should glorify failure, you want this kind of culture – having some acceptance of failure reduces the social risk of starting a new venture.

Some acceptance of failure is not unheard of in the military world either. I actually chose my first screen name based on a major event on this theme.

The Campaign

If a bigger and more comprehensive failure increases learning experience, then the Gallipoli campaign of 1915 was a serious education for the Allies. For them, everything went wrong.

First the backstory:

The Gallipoli Campaign…took place on the Gallipoli peninsula (Gelibolu in modern Turkey) in the Ottoman Empire between 25 April 1915 and 9 January 1916, during World War I. […] an amphibious landing was undertaken on the Gallipoli peninsula, to capture the Ottoman capital of Constantinople (Istanbul). After eight months the land campaign also failed with many casualties on both sides, and the invasion force was withdrawn to Egypt.

The Many Problems Faced by the Anzacs

  • They faced a motivated, dynamic, and prepared opposition lead by the founding father of modern Turkey.[1]

  • Exceedingly poor inter-branch communication/cooperation. There was no unified command for the operation – the navy was commanded independently and failed to clear the Dardanelles straits.

  • Incompetent logistician work. Many ships were not combat loaded and needed to stop in Alexandria to be reorganized, delaying for a month.

  • Contemporary military technology heavily favored the defensive and tactics to mitigate massed defensive firepower were primitive

  • Poor training and preparation. Both forces acted with bravery and stoicism under incredible hardships, but both received limited training in the run-up to the battle.

  • “…it took place in circumstances in which nearly everything was experimental: in the use of submarines and aircraft, in the trial of modern naval guns agains artillery on the shore…the use of radio…land mines” (Alan Moorehead, Gallipoli)

The Failure

Gallipoli was a fiasco. A commission was set up to examine the incident and

…concluded that the expedition was poorly planned and executed and that difficulties had been underestimated, problems which were exacerbated by supply shortages and by personality clashes and procrastination at high levels.

Much of the blame fell on this guy:

Chief Lord of the Admiralty, Winston Churchill. He was demoted and resigned. But as it turns out he wasn’t done with politics.

The Lesson

Gallipoli is textbook example of unsuccessful modern amphibious attacks.

the campaign “became a focal point for the study of amphibious warfare” in the United Kingdom and United States because it involved “all four types of amphibious operations: the raid, demonstration, assault and withdrawal”.

Unified command. Understanding of the terrain. The element of surprise. Rapid expansion from the beachhead. Overwhelming fire suppression. Proper logistical and training preparation. Lessons hard learned.

The Triumph

This information directly informed planners and the strategic thinking of Allied leaders when formulating the Normandy invasion. From the BBC:

[The planners] had grasped the vital necessity for an adequate period of planning for all three services – the army, navy and air force.

Meanwhile the slaughter on the Gallipoli beaches had taught the planners the necessity of smothering the immediate beach areas with massed fire from rocket ships and mortars to neutralise German defensive positions.

Churchill certainly never forgot Gallipoli. In the 1920s, he published The World Crisis, which went into great detail – step by step – into the political and strategic background of the campaign (Moorehead). And later he was instrumental in making the Normandy landings possible and successful.

[1] The commander of Ottoman forces was one Mustafa Kemal, now more commonly known as Atatürk or “Father of Turks” and went on to found the modern nation of Turkey. Reacting quickly to the Anzac landings, he launched a vicious counterattack which seized the high ground and essentially doomed the invasion.

REST and File Uploads/Attachments

Your web application will support uploading files. At first glance, this is an action and you might consider working with it as an RPC endpoint rather than REST. The upload could refer to a verb rather than a noun.

There isn’t anything really wrong with this, but I would argue there are significant advantages with going with it as a noun (REST resource). Here are a few:

Staging an upload to external datastore

An upload may not be directly to you, and it might not be used by the requesting client – signed S3 forms, one-time URL endpoints, other protocols like Bittorrent, and other mechanisms that allow direct client uploads.

Example:

1
2
3
4
5
6
{
    "upload_to_url": "https://example.com/one/time/endpoint/hashhashhash",
    "signed_token": "blahblahblah",
    "expires": "2013-07-12T19:10:19.491Z",
    "etc": "..."
}

Tracking/auditing – both internally and externally

What if a user wants to see what uploads are currently in progress? All of the successful ones? The failures? Those are all also useful metrics internally as well.

1
2
3
4
5
6
{
    "createuser": "https://example.com/user/1234",
    "modifieduser": "https://example.com/user/1234",
    "createdate": "2013-07-12T19:10:19.491Z",
    "modifieduser": "2013-07-12T19:10:19.491Z"
}

Attaching additional resources as a means of post-upload action

The file being uploaded is unlikely to exist in a vacuum. You will have related resources and possibly related actions. Consider, for instance, that you want to send alerts to some people when the upload is complete:

1
2
3
4
5
6
7
8
{
    "subscribers": [
      "https://example.com/user/1234",
      "https://example.com/user/288",
      "https://example.com/user/3"
    ],
    "etc": "..."
}

Explicit vs. implicit

Bottom line – your upload has state information. You are probably capturing it anyway in logs or other resources. If you have some subscribers as above, you want to make that information explicit, and in many cases, client controlled.

Testing Search (Haystack) in Django

Django’s build-in testing framework is extremely handy. As long as you use the ORM with a supported data store, a test database is used for the duration of the tests and is cleaned up in between unit tests. There is no need for elaborate mocking – something I had grown accustom to in .NET.

Here is a quick sample, edited for brevity:

1
2
3
4
5
6
7
8
9
10
11
$ ./manage.py test appname -v 2

Creating test database for alias 'default' ('test_projectname')Syncing...
Creating tables ...
test_first (projectname.test.SampleTestClass) ... ok
test_second (projectname.test.SampleTestClass) ... ok
test_third (projectname.test.SampleTestClass) ... ok
Ran 3 tests in 1.260s
OK
Destroying test database for alias 'default' ('test_projectname')

But if you are using some external source of data, it is necessary to create a mock or some fake environment (as Django does).

Haystack is a handy library that abstracts out the details of various search engines. You get some powerful features build into something like Elasticsearch – high availability, full text search, spelling correct, more like this, etc – in some functions and data structures familiar to Django using developers.

But if you are integration testing, and you should be – the tests are calling your views directly and your views are updating or retrieving data from an external search engine, you are going to potentially have a bad time. Stuff will be persisted between unit tests and your results will be likely be inconsistent.

The solution is pretty simple actually. Fire up a new index, override the settings such that the new index is the target for the Haystack calls for the duration of tests, and clear the index between tests.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
TEST_INDEX = {
    'default': {
        'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
        'URL': 'http://127.0.0.1:9200/',
        'TIMEOUT': 60 * 10,
        'INDEX_NAME': 'test_index',
    },
}


@override_settings(HAYSTACK_CONNECTIONS=TEST_INDEX)
class BaseTestCase(TestCase):

    def setUp(self):
        haystack.connections.reload('default')
        super(BaseTestCase, self).setUp()

    def tearDown(self):
        call_command('clear_index', interactive=False, verbosity=0)

Gist

AI/Machine Learning Python Samples

I have a new repository on github that demonstrates a couple of basic machine learning and AI techniques, principally picked up from CS_373 and Stanford’s Introduction to AI. It’s all explained there, and I intend to add to it as I continue my eduction in the field.

Machine learning is something I rarely hear talked about in the spatial developer field. This is unfortunate, as machine learning can be an effecitive means of analyzing, managing, and generating spatial information.

Another cool thing is the documentation I put together for this. I used pycco, a very easy to use annotated code document generator (port of Docco). Here is a particle filter. Here is a Kalman filter.

Spatial Correction Using a Particle Filter

The Problem – old data, important data

Some of the most important spatial data is old. It was built up and maintained over decades by paper and early computer systems, and it represents power lines, roads, water pipes, and property lines. It would be good to know the precise location of this stuff. The PLC power system was designed on in-house drawn lotlines. Today, the difference between those lotlines and the actual parcel locations is as much as 100ft, and in no consistent direction. What follows are attempts to correct the location of more than 20,000 structures without doing a significant portion by hand, using some techniques picked up in Stanford’s Free AI class.

The correct location is the “Hidden” bit

Education in some very advanced and useful algorithms are now within the grasp of anyone with an internet connection and a decade old computer. More than a hundred thousand participated in the recently completed Stanford AI course, including myself. One particular technique caught my eye: The problem being solved above is one of location – that is the hidden variable that needs to be estimated in continuous space. Why couldn’t I do something similar for static assets like poles and underground vaults? With enough control points I could then move everything else relative to them (inverse distance weighted rubbersheeting) and vastly improve the data.

A Naive Approach

I wanted to start with the simplest possible implementation. I loaded the lotlines (old, hand-drawn), parcel polygons, and the poles into PostGIS. I then converted the lines and polygons to points, and decided to use the total sum distance as the mechanism for comparing candidate particles to the poles. Again, very naive (and the data is too noisy for it to work), but it served a purpose – getting everything set up for my next iteration: comparing candidates based on tangent and distance as the robot sensors above undoubtebly do.

TileMill - What It Does and Some Reasons to Try It

We were promised jetpacks, but I’ll take [Tilemill](http://mapbox.com/tilemill/) as a temporary replacement:

jetpacks

The MapBox/DevelopmentSeed team has created one of the the last pieces really needed for mainstream open source GIS to gain really massive appeal

TileMill is used for making web maps - or more specifically - for generating tiles that make up the now-ubiquiteous slippy maps we see online.

There are other desktop applications that do this, the most notable being ArcGIS Desktop. But Desktop was built for other things first: advanced analysis tools, some pretty powerful editing capabilities, and authoring paper maps.

TileMill does one thing and it does it well. It costs nothing (compared to several thousand for some flavor of ArcMap), and outputs an open tile format that you can wire up to a webmap or iPad in less time than it takes to install ArcMap.

And it is smooth. The user experience is the best I have had with a desktop application in a long while.

It also has sane, plaintext css-like styling (MSS). This may sound like a no-brainer, but your options before this were basically some proprietary binary format from ESRI (not extensible, difficult to automate, limiting, vendor specific) or SLD, which is open source but widely regarded as something of a mess for other reasons.

There is also the training issue. ArcMap is giant and powerful - and extremely complex. The market for “GIS Analysts” is still strong in a large part because of this complexity. Less experienced users will find TillMill easier to pick up and web designers (of which there is a large pool of talent) will find it very easy.

It is out for every operating system of note. Go give it a try.

Introducting (Belatedly) Nx_spatial

It’s been more or less done a while, but here is finally a blog post about it.

nx_spatial is a collection of addon functions for the networkx python graph library. What can you do with it?

  1. Load GIS formats into networkx graphs (where you can do all sorts of crazy analytics on them)
  2. Perform upstream and downstream traces with stopping points.
  3. Set sources and find/repair edges that don’t have the correct to/from nodes.

Example from the wiki:

1
2
3
4
5
6
7
import nx_spatial as ns
net = ns.read_shp('/shapes/lines.shp')
net.edges() [[(1.0, 1.0), (2.0, 2.0)], [(2.0, 2.0), (3.0, 3.0)], [(0.9, 0.9), (4.0, 2.0)]]
net.nodes() [(1.0, 1.0), (2.0, 2.0), (3.0, 3.0), (0.9, 0.9), (4.0, 2.0)]
source = (2.0, 2.0)
ns.setdirection(net, source)
net.edges() [[(2.0, 2.0), (1.0, 1.0)], [(2.0, 2.0), (3.0, 3.0)], [(0.9, 0.9), (4.0, 2.0)]]

Available on pypi or bitbucket. Eventually I want to integrate it with networkx trunk (loading shapefiles is already in 1.4).

Posted via email from The Pragmatic Geographer

HTML5 File API: First Impressions

Recently went to an HTML5 Hackathon at Google Kirkland. My group’s project was an in-browser IDE Chrome extension that zipped up a user-provided series of HTML/CSS/JS files into a package that could be uploaded to the Chrome Store. Issac Lewis came up with the idea after trying to develop chrome extensions on his chromebook and finding it basically impossible to do. Storing the files was a perfect use case for the FileSystem API, but I spent most of my time beating my head against the wall to get it working. Here are some of the things I wish I knew going in.

The FileSystem API is not LocalStorage.

LocalStorage is a key-value store, the FileSystem API really is an entire virtual file system, sandboxed on a user’s local file system. You write, read, and create files async. It’s also only implemented currently in Chrome. The documentation says 9+, but I hit errors until I switched from Chromium 12 to Chrome 13.

There’s no limit to the storage, currently.

Hell yeah, cache all your map data on the user’s local file system without needing an explicit download or local client built for it. That’s a big deal for conditions or places with little to no connectivity. Also a big deal for massive games with a ton of art assets. They go through some good use cases here.

Debugging is a pain.

You will hit the dreaded SECURITY_ERR or QUOTA_EXCEEDED_ERR at some point, and it will be because debugging locally (file://) doesn’t work well in my experience. The documentation suggests it’s possible by opening Chrome with the —unlimited-quota-for-files and —allow-file-access-from-files flags, but my problems were only resolved when I started debugging as an extension rather than as a local file. You also need to be careful about the flux the API is in. Throwing around BlobBuilder() and other pieces of the newer APIs can throw errors that can be difficult to track down. BlobBuilder didn’t work for me, I needed window.WebKitBlobBuilder. That webkit prefixing shows up elsewhere as well (like window.webkitRequestFileSystem).

Feel no guilt in lifting gratuitously from the sample docs when starting out. Async file access isn’t really any wierder than any other browser async work, but there is some boilerplate code that is worth snapping up. Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
//error handling 
function errorHandler(e) {
  var msg = '';


  switch (e.code) {
    case FileError.QUOTA_EXCEEDED_ERR:
      msg = 'QUOTA_EXCEEDED_ERR';
      break;
    case FileError.NOT_FOUND_ERR:
      msg = 'NOT_FOUND_ERR';
      break;
    case FileError.SECURITY_ERR:
      msg = 'SECURITY_ERR';
      break;
    case FileError.INVALID_MODIFICATION_ERR:
      msg = 'INVALID_MODIFICATION_ERR';
      break;
    case FileError.INVALID_STATE_ERR:
      msg = 'INVALID_STATE_ERR';
      break;
    default:
      msg = 'Unknown Error';
      break;
  };


  console.log('Error: ' + msg);
}
//file system instantiation
window.requestFileSystem(window.PERSISTENT, 5*1024*1024 /*5MB*/, FSCreatedSuccess, errorHandler);

This kind of thing is okay starting out, but you’ll want a lot more out of the error handling eventually. The message is fine, but the code tells you nothing about where the error occurred and in reference to what object or operation.

It’s not CRUD, mostly. Don’t look for an explicit create method somewhere, the default is get or create via [filesystem_obj].[directory].get[Directory|File]. All reading, writing, and updating is probably going to live in a closure that starts with that first get.

Don’t rush.

I made the mistake of looking at the limited time allocated and starting just throwing the example code in willy-nilly. This is not what you do with an unfamiliar and very new API. The typical help online is not there yet because it hasn’t been used yet in a widespread way, throwing those error messages into google is not going to help you (unless that is how you got to this page, naturally). Start with the example code, sure, but I would carefully read the entirety of the short intro before trying random things to get it to work.