Visualizing California's Counties

Use this dataset from the US Census Bureau:

http://www2.census.gov/geo/docs/reference/county_adjacency.txt

Download the dataset.

In [1]:
!wget http://www2.census.gov/geo/docs/reference/county_adjacency.txt -O county_adjacency.txt
--2017-04-04 12:13:19--  http://www2.census.gov/geo/docs/reference/county_adjacency.txt
Resolving www2.census.gov... 104.68.125.234, 2600:1406:34:29d::208c, 2600:1406:34:29a::208c
Connecting to www2.census.gov|104.68.125.234|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 726724 (710K) [text/plain]
Saving to: ‘county_adjacency.txt’

county_adjacency.tx 100%[===================>] 709.69K  --.-KB/s    in 0.04s   

2017-04-04 12:13:20 (16.2 MB/s) - ‘county_adjacency.txt’ saved [726724/726724]

Examine the file format.

It's a tab-separated value format where duplicated entries within a column have been removed with blanks.

In [2]:
!head county_adjacency.txt
"Autauga County, AL"	01001	"Autauga County, AL"	01001
		"Chilton County, AL"	01021
		"Dallas County, AL"	01047
		"Elmore County, AL"	01051
		"Lowndes County, AL"	01085
		"Montgomery County, AL"	01101
"Baldwin County, AL"	01003	"Baldwin County, AL"	01003
		"Clarke County, AL"	01025
		"Escambia County, AL"	01053
		"Mobile County, AL"	01097

Extract California Counties

Store them in a graph (dict of set).

In [3]:
with open("county_adjacency.txt") as f:
    tups = [tuple(line.rstrip().split("\t")) for line in f.readlines()]

graph = {}

for t in tups:
    if t[0]:
        a, b = t[0][1:-1], t[2][1:-1]
    else:
        b = t[2][1:-1]
    
    if a == b: continue
    if not a.endswith("CA") or not b.endswith("CA"): continue
        
    if a not in graph:
        graph[a] = set()
        
    graph[a].add(b)
    
print len(graph), "total counties"
58 total counties

Build graphviz input file

First we need to give each county a short identifier.

In [4]:
import re

idents = {name: 
                    re.sub("[^a-z]", "_", name.replace("County, CA","").lower().strip())
               for name in graph}
    
print list(idents.items())[:10]
[('Nevada County, CA', 'nevada'), ('Alameda County, CA', 'alameda'), ('Kings County, CA', 'kings'), ('Del Norte County, CA', 'del_norte'), ('El Dorado County, CA', 'el_dorado'), ('San Joaquin County, CA', 'san_joaquin'), ('Imperial County, CA', 'imperial'), ('San Luis Obispo County, CA', 'san_luis_obispo'), ('Modoc County, CA', 'modoc'), ('Colusa County, CA', 'colusa')]
In [5]:
with open("counties.dot", "w") as f:
    f.write("graph {\n")
    f.write("  santa_cruz [color=red];\n")
    for a, bs in graph.items():
        for b in bs:
            if a < b:
                f.write("  %s -- %s;\n" % (idents[a], idents[b]))
    f.write("}\n")
In [6]:
!head counties.dot
graph {
  santa_cruz [color=red];
  nevada -- sierra;
  nevada -- yuba;
  nevada -- placer;
  alameda -- san_joaquin;
  alameda -- san_francisco;
  alameda -- contra_costa;
  alameda -- san_mateo;
  alameda -- stanislaus;

Execute graphviz

In [7]:
%%time
!fdp counties.dot -Tpng -o counties.png
CPU times: user 28.5 ms, sys: 16 ms, total: 44.5 ms
Wall time: 2.34 s
In [8]:
from IPython.display import Image
Image("counties.png")
Out[8]: