Diffbot is a powerful scraping tool that can be used to automatically parse a wide variety of web pages using computer vision. Developers are using the product to power reading services (Instapaper, Digg, Reverb, Onswipe, Longform), price comparison engines, media monitoring tools, and many other apps and systems.

When analyzing consumer products and platforms, I am eager to explore novel ways to find data that can help inform my decisions. This fall, I was looking at a handful of fashion resale marketplaces that had seen rapid growth. The specific marketplace we were analyzing provided us with lots of data, but we wanted a more thorough understanding of their competitors. I spun up Diffbot’s Crawlbot tool to quickly run crawls on the three main marketplaces and automatically extract pricing data using Diffbot’s Product API.

The first set of analysis helped us understand the relative size of each of the marketplaces and price of the products. The prices are important as it may inform the potential profile of customers and also the basket size from which the platform will earn a percentage. Four months later we ran the same analysis on two of the marketplaces to compare how they had evolved over time. Out of respect, I chose to anonymize the marketplaces to MPA, MPB, MPC.

MARKETPLACE DATA

Number of Items – Diffbot’s Crawlbot found 61,240 items on MPB with offer prices vs. 18,992 items on MPA[1].

List Price - Items on MPA have a noticeably higher list price than those on MPB. The mean list price on MPA is 277% that of MPB ($521 vs. $188), and 75% of the items on MPC are less than $120 vs. $388 on MPA. This may indicate that MPA is attracting higher quality items for resale.

Offer Price - MPA’s mean offer price is 219% that of MPB ($234 vs. $107). MPC sells significantly higher priced items, but discounts steeply, so the mean offer price ($275) is only 117% that of MPA but 257% that of MPB. 75% of items on MPB are less than $50, vs. $143 on MPA and $235 on MPC. MPB clearly skews significantly cheaper.

Average Discount – The distribution of percent discount is roughly equal for MPA and MPB. MPC, however, offers a much steeper discount from list price.

The figure below clearly indicates two important details: 1) MPB has significantly more products in their marketplace and 2) MPA has a much healthier long tail of higher priced products. The long tail MPA has captured will help increase AOV and may attract and retain higher value consumers.

ANALYSIS OVER TIME

  • In December MPA had ~19k items for sale vs. ~63k on MPB
  • In March MPA now has ~264k items for sale (14.6x growth) vs. ~278k on MPB (4.4x growth)
  • In December MPA had a much healthier long tail of higher priced items (the 75th percentile was $149 vs. $49 on MPB)
  • In March MPA has an almost equal number of products but has maintained a healthier long tail of higher priced items (the 75th percentile was $87 vs. $45) which will drive higher AOV and help attract higher value customers

Thanks in large part to the NSA, we are increasingly aware of the extent to which our digital lives are being tracked, recorded and analyzed. But it is easy to forget that our seemingly analogue activities can leave just as significant of a digital footprint as digital services.

Citibike is a fine example. Over the past 10 months I made 268 trips on Citibike, and like all Citibike trips, each of these was meticulously recorded and stored away online by the Citi service. Unfortunately Citibike does not currently have an API, nor any export functionality, making it difficult for anyone (NSA or otherwise) to explore their data.

Enter Kimono Labs, to the rescue!

Kimono recently released released Authenticated API creation. This new tool makes it incredibly simple to scrape data from behind log-in portals. In less than 5 minutes I set one up to extract my Citibike trips.

With the raw data in hand, I turned to CartoDB to visualize it. CartoDB is the best product I have found for visualizing geo-temporal data. I teamed up with Andrew Hill, developer evangelist at CartoDB, to make a beautiful moving map of my life zipping around NYC.

Looking at the trips and parsing the data it becomes clear that de Blasio could easily figure out things like: where I live and work, if I lost my job, if I was dating someone (or having an affair if I was listed as married in the census).

Animation of My Roommate Bay Gross’ Trips

Static Maps of Our Trips

Some Fun Facts

  • 324 Miles traveled (that’s the equivalent to biking to Boston from New York and half way back again)
  • 1,963 minutess spent biking (thats over 31 hours on my toush!)
  • Average distance traveled per trip: 1.26 miles (max: 3.68 min: 0.27)
  • Average trip time: 7mins and 49secs
  • Average trip speed: 10.57 mph
  • Weekday morning commute average speed: 12.10mph (oh shit, i’m late!)
  • Weekday evening commute average speed: 8.97mph
  • Weekday average speed: 10.87mph
  • Weekend average speed: 9.68mph
  • Average Speed vs. Google Estimate e: 1.81% faster
  • Average arrival time to work: 8:42 (28.8% of the time I arrive after 9am and 10.58% of the time I arrive before 8am)
  • Evening weekday trips that begin outside our office start on average at 7:36pm (31.57% start after 8pm)
  • For weekday trips that finish at the station closest to my apartment end on average at 8:50pm (31.03% end after 10pm)

In the graph below you can easily see how much faster I bike during morning commute relative to other trips, how my weekend rides tend to start later in the day, and also the handful of late night bikes home.

Your Data

Data Collection

  • Set up an API with Kimono
  • Name each of the four fields: Start_Station, Start_Time, End_Station, End_Time
  • Get the API credentials and add to the Ruby script
  • Sign up for a Google API account and turn on Google Distance Matrix API and add the API key to the ruby script
  • Download a copy of the station JSON feed to geocode the station names
  • Run this ruby script as ruby citibike.rb

Analysis

Visualization

  • Upload the CSV to CartoDB
  • Create linestrings: UPDATE table_name SET the_geom = ST_MakeLine(cdb_latlng(start_station_lat::numeric, start_station_lon::numeric), cdb_latlng(end_station_lat::numeric, end_station_lon::numeric))
  • Download and upload the OSM street data
  • Load Andrew’s SQL functions into the SQL Editor
  • Snap the linestrings to roads: UPDATE tablename SET the_geom = axh_blend_lines(the_geom)
  • Generate points for a Torque map: WITH a AS (SELECT (axh_linetime_to_points(the_geom, start_time, end_time, 20)).* FROM table) SELECT geom as the_geom, when_at FROM a
  • Run ‘Table from Query’ in the interface to create a table for the Torque map

Thanks: A big thanks for Andrew Hill for doing things in PostGIS I have no idea about, and to Bay Gross for his edits.

Consumer products often have great data that can be used to inform a perspective on trends and market sizing. Two basic scripts I have written for this purpose are outlined below.

Twitter

I frequently want to know how often a specific @handle is mentioned, or term is used, on Twitter. There are existing tools like Topsy that help answer this question, but I have never particularly liked any of them.

I ended up writing my own simple Ruby script to query the Twitter API for a given term. I then combined this with an R script so that I could graph the number of mentions a day for several terms1.

You can find the Ruby and bash scripts on Github

Yelp

Occasionally I want to know how many of a certain type of SMB there are in particular US cities. Yelp is one of the best free source of this data2, and using their API it is fairly trivial to set up a basic Ruby script to programatically get a count.

You can do this for a single search term in a single city using this script I wrote.

You could query for a variety of terms across many different US cities using a different script.

Footnotes

  1. It would be awesome if Twitter actually had Twitter Trends like Google Trends

  2. I haven’t used Factual, though I think it may be a more accurate data set, but you have to pay for it