Diffbot is a powerful scraping tool that can be used to automatically parse a wide variety of web pages using computer vision. Developers are using the product to power reading services (Instapaper, Digg, Reverb, Onswipe, Longform), price comparison engines, media monitoring tools, and many other apps and systems.
When analyzing consumer products and platforms, I am eager to explore novel ways to find data that can help inform my decisions. This fall, I was looking at a handful of fashion resale marketplaces that had seen rapid growth. The specific marketplace we were analyzing provided us with lots of data, but we wanted a more thorough understanding of their competitors. I spun up Diffbot’s Crawlbot tool to quickly run crawls on the three main marketplaces and automatically extract pricing data using Diffbot’s Product API.
The first set of analysis helped us understand the relative size of each of the marketplaces and price of the products. The prices are important as it may inform the potential profile of customers and also the basket size from which the platform will earn a percentage. Four months later we ran the same analysis on two of the marketplaces to compare how they had evolved over time. Out of respect, I chose to anonymize the marketplaces to MPA, MPB, MPC.
Number of Items – Diffbot’s Crawlbot found 61,240 items on MPB with offer prices vs. 18,992 items on MPA.
List Price - Items on MPA have a noticeably higher list price than those on MPB. The mean list price on MPA is 277% that of MPB ($521 vs. $188), and 75% of the items on MPC are less than $120 vs. $388 on MPA. This may indicate that MPA is attracting higher quality items for resale.
Offer Price - MPA’s mean offer price is 219% that of MPB ($234 vs. $107). MPC sells significantly higher priced items, but discounts steeply, so the mean offer price ($275) is only 117% that of MPA but 257% that of MPB. 75% of items on MPB are less than $50, vs. $143 on MPA and $235 on MPC. MPB clearly skews significantly cheaper.
Average Discount – The distribution of percent discount is roughly equal for MPA and MPB. MPC, however, offers a much steeper discount from list price.
The figure below clearly indicates two important details: 1) MPB has significantly more products in their marketplace and 2) MPA has a much healthier long tail of higher priced products. The long tail MPA has captured will help increase AOV and may attract and retain higher value consumers.
ANALYSIS OVER TIME
In December MPA had ~19k items for sale vs. ~63k on MPB
In March MPA now has ~264k items for sale (14.6x growth) vs. ~278k on MPB (4.4x growth)
In December MPA had a much healthier long tail of higher priced items (the 75th percentile was $149 vs. $49 on MPB)
In March MPA has an almost equal number of products but has maintained a healthier long tail of higher priced items (the 75th percentile was $87 vs. $45) which will drive higher AOV and help attract higher value customers
Thanks in large part to the NSA, we are increasingly aware of the extent to which our digital lives are being tracked, recorded and analyzed. But it is easy to forget that our seemingly analogue activities can leave just as significant of a digital footprint as digital services.
Citibike is a fine example. Over the past 10 months I made 268 trips on Citibike, and like all Citibike trips, each of these was meticulously recorded and stored away online by the Citi service. Unfortunately Citibike does not currently have an API, nor any export functionality, making it difficult for anyone (NSA or otherwise) to explore their data.
Enter Kimono Labs, to the rescue!
Kimono recently released released Authenticated API creation. This new tool makes it incredibly simple to scrape data from behind log-in portals. In less than 5 minutes I set one up to extract my Citibike trips.
With the raw data in hand, I turned to CartoDB to visualize it. CartoDB is the best product I have found for visualizing geo-temporal data. I teamed up with Andrew Hill, developer evangelist at CartoDB, to make a beautiful moving map of my life zipping around NYC.
Looking at the trips and parsing the data it becomes clear that de Blasio could easily figure out things like: where I live and work, if I lost my job, if I was dating someone (or having an affair if I was listed as married in the census).
324 Miles traveled (that’s the equivalent to biking to Boston from New York and half way back again)
1,963 minutess spent biking (thats over 31 hours on my toush!)
Average distance traveled per trip: 1.26 miles (max: 3.68 min: 0.27)
Average trip time: 7mins and 49secs
Average trip speed: 10.57 mph
Weekday morning commute average speed: 12.10mph (oh shit, i’m late!)
Weekday evening commute average speed: 8.97mph
Weekday average speed: 10.87mph
Weekend average speed: 9.68mph
Average Speed vs. Google Estimate e: 1.81% faster
Average arrival time to work: 8:42 (28.8% of the time I arrive after 9am and 10.58% of the time I arrive before 8am)
Evening weekday trips that begin outside our office start on average at 7:36pm (31.57% start after 8pm)
For weekday trips that finish at the station closest to my apartment end on average at 8:50pm (31.03% end after 10pm)
In the graph below you can easily see how much faster I bike during morning commute relative to other trips, how my weekend rides tend to start later in the day, and also the handful of late night bikes home.
Set up an API with Kimono
Name each of the four fields: Start_Station, Start_Time, End_Station, End_Time
Get the API credentials and add to the Ruby script
Sign up for a Google API account and turn on Google Distance Matrix API and add the API key to the ruby script