Update Wed Sept 24

The author of the NYTimes interactive graphic of restaurant ratings pointed me to his data which I have also uploaded to CartoDB as an open data set and visualization. The author of the NYTimes visualization collected the data by scraping the city’s servers (Department of Health and Mental Hygiene) over several days. The author has 12,290 restaurants, where as the data set I geocoded from NYC’s Open Data Portal has ~24,957.

I have struggeled to find a public database of all NYC restaurants and their geocoded address. There is a Quora question on the topic, but not great answer. Dan Kozikowski of FirstMark Capital has a great post from 2 years ago with a heatmap of restaurant density, but sadly didn’t have the raw data anymore. The NYTimes has an interactive graphic of geoded NYC Health Department ratings of restaurants, but I couldn’t get the data from the author or from the graphic (retrieved).

So, I am open sourcing a geocoded version of NYC’s Health Department Restaurants Ratings database. To get this data set I:

  • Download all the health ratings from NYC’s Open Data Portal
  • Subsetted to just the business name, building, street, and zipcode
  • Uploaded to a Google sheet
  • Created and ran an appscript to Geocode each address and save down the lat / lon
  • It isn’t perfect as there are a few restaurants geocoded outside of NYC and a few that couldn’t be coded

Here is the Google Sheet

Here is the CartoDB visualization

Diffbot is a powerful scraping tool that can be used to automatically parse a wide variety of web pages using computer vision. Developers are using the product to power reading services (Instapaper, Digg, Reverb, Onswipe, Longform), price comparison engines, media monitoring tools, and many other apps and systems.

When analyzing consumer products and platforms, I am eager to explore novel ways to find data that can help inform my decisions. This fall, I was looking at a handful of fashion resale marketplaces that had seen rapid growth. The specific marketplace we were analyzing provided us with lots of data, but we wanted a more thorough understanding of their competitors. I spun up Diffbot’s Crawlbot tool to quickly run crawls on the three main marketplaces and automatically extract pricing data using Diffbot’s Product API.

The first set of analysis helped us understand the relative size of each of the marketplaces and price of the products. The prices are important as it may inform the potential profile of customers and also the basket size from which the platform will earn a percentage. Four months later we ran the same analysis on two of the marketplaces to compare how they had evolved over time. Out of respect, I chose to anonymize the marketplaces to MPA, MPB, MPC.


Number of Items – Diffbot’s Crawlbot found 61,240 items on MPB with offer prices vs. 18,992 items on MPA[1].

List Price - Items on MPA have a noticeably higher list price than those on MPB. The mean list price on MPA is 277% that of MPB ($521 vs. $188), and 75% of the items on MPC are less than $120 vs. $388 on MPA. This may indicate that MPA is attracting higher quality items for resale.

Offer Price - MPA’s mean offer price is 219% that of MPB ($234 vs. $107). MPC sells significantly higher priced items, but discounts steeply, so the mean offer price ($275) is only 117% that of MPA but 257% that of MPB. 75% of items on MPB are less than $50, vs. $143 on MPA and $235 on MPC. MPB clearly skews significantly cheaper.

Average Discount – The distribution of percent discount is roughly equal for MPA and MPB. MPC, however, offers a much steeper discount from list price.

The figure below clearly indicates two important details: 1) MPB has significantly more products in their marketplace and 2) MPA has a much healthier long tail of higher priced products. The long tail MPA has captured will help increase AOV and may attract and retain higher value consumers.


  • In December MPA had ~19k items for sale vs. ~63k on MPB
  • In March MPA now has ~264k items for sale (14.6x growth) vs. ~278k on MPB (4.4x growth)
  • In December MPA had a much healthier long tail of higher priced items (the 75th percentile was $149 vs. $49 on MPB)
  • In March MPA has an almost equal number of products but has maintained a healthier long tail of higher priced items (the 75th percentile was $87 vs. $45) which will drive higher AOV and help attract higher value customers

Thanks in large part to the NSA, we are increasingly aware of the extent to which our digital lives are being tracked, recorded and analyzed. But it is easy to forget that our seemingly analogue activities can leave just as significant of a digital footprint as digital services.

Citibike is a fine example. Over the past 10 months I made 268 trips on Citibike, and like all Citibike trips, each of these was meticulously recorded and stored away online by the Citi service. Unfortunately Citibike does not currently have an API, nor any export functionality, making it difficult for anyone (NSA or otherwise) to explore their data.

Enter Kimono Labs, to the rescue!

Kimono recently released released Authenticated API creation. This new tool makes it incredibly simple to scrape data from behind log-in portals. In less than 5 minutes I set one up to extract my Citibike trips.

With the raw data in hand, I turned to CartoDB to visualize it. CartoDB is the best product I have found for visualizing geo-temporal data. I teamed up with Andrew Hill, developer evangelist at CartoDB, to make a beautiful moving map of my life zipping around NYC.

Looking at the trips and parsing the data it becomes clear that de Blasio could easily figure out things like: where I live and work, if I lost my job, if I was dating someone (or having an affair if I was listed as married in the census).

Animation of My Roommate Bay Gross’ Trips

Static Maps of Our Trips

Some Fun Facts

  • 324 Miles traveled (that’s the equivalent to biking to Boston from New York and half way back again)
  • 1,963 minutess spent biking (thats over 31 hours on my toush!)
  • Average distance traveled per trip: 1.26 miles (max: 3.68 min: 0.27)
  • Average trip time: 7mins and 49secs
  • Average trip speed: 10.57 mph
  • Weekday morning commute average speed: 12.10mph (oh shit, i’m late!)
  • Weekday evening commute average speed: 8.97mph
  • Weekday average speed: 10.87mph
  • Weekend average speed: 9.68mph
  • Average Speed vs. Google Estimate e: 1.81% faster
  • Average arrival time to work: 8:42 (28.8% of the time I arrive after 9am and 10.58% of the time I arrive before 8am)
  • Evening weekday trips that begin outside our office start on average at 7:36pm (31.57% start after 8pm)
  • For weekday trips that finish at the station closest to my apartment end on average at 8:50pm (31.03% end after 10pm)

In the graph below you can easily see how much faster I bike during morning commute relative to other trips, how my weekend rides tend to start later in the day, and also the handful of late night bikes home.

Your Data

Data Collection

  • Set up an API with Kimono
  • Name each of the four fields: Start_Station, Start_Time, End_Station, End_Time
  • Get the API credentials and add to the Ruby script
  • Sign up for a Google API account and turn on Google Distance Matrix API and add the API key to the ruby script
  • Download a copy of the station JSON feed to geocode the station names
  • Run this ruby script as ruby citibike.rb



  • Upload the CSV to CartoDB
  • Create linestrings: UPDATE table_name SET the_geom = ST_MakeLine(cdb_latlng(start_station_lat::numeric, start_station_lon::numeric), cdb_latlng(end_station_lat::numeric, end_station_lon::numeric))
  • Download and upload the OSM street data
  • Load Andrew’s SQL functions into the SQL Editor
  • Snap the linestrings to roads: UPDATE tablename SET the_geom = axh_blend_lines(the_geom)
  • Generate points for a Torque map: WITH a AS (SELECT (axh_linetime_to_points(the_geom, start_time, end_time, 20)).* FROM table) SELECT geom as the_geom, when_at FROM a
  • Run ‘Table from Query’ in the interface to create a table for the Torque map

Thanks: A big thanks for Andrew Hill for doing things in PostGIS I have no idea about, and to Bay Gross for his edits.