Crawling for music

6 jan. 2014
Tags: Podcast , Python

In the beginning of December last year I realized I really liked the music being played between the subtopics in the Freakonimics podcast. I found a forum posting asking for a convenient way to get access to the music, but the only solution was that the tracks was being added in the transcripts in [MUSIC...] tags.

So, how to get access to them via Spotify:

Since the information we are looking for, artist and track name, is available in a predefined format, all that is needed it to scrape the site meaning having a program visit each page and collect the tags. Then the tags must be cleaned to separate out noise, and lastly they must be searched for in the Spotify network and added to a playlist.

The first step is quite easy using a web crawler. I reused some modified Python code for crawling, and made a simple parser of the content. The output is a comma separated list of artist and song name. In order to transform this information to a Spotify playlist there is a cool online tool called Ivy that will do just that taking the prepared input.

Ivy was able to find 60 of the 136 unique songs at the time of the "experiment".

This is the initial code:

crawler
size 6.2 KiB
sha256: 39c5675fe2...87f708548a


crawl
size 1.0 KiB
sha256: 801ed4339a...a9cc18e91d


clean_crawl_result
size 946.0 bytes
sha256: 3e697655ec...82a863c44a

#-*- coding: utf-8 -*-
#
# Crawler.py
#
# Copyright (C) 2010 - Wei-Ning Huang (AZ) <aitjcize@gmail.com>
# All Rights reserved.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

# André: Did some minor modifications

import httplib
import re
import sys

from posixpath import join, dirname, normpath
from threading import Thread, Lock
from urllib import quote

class Document(object):
def __init__(self, res, url):
self.url = url
self.query = '' if not '?' in url else url.split('?')[-1]
self.status = res.status
self.text = res.read()

class Crawler(object):
'''
A Crawler that crawls through cplusplus.com
'''
F_ANY, F_SAME_DOMAIN, F_SAME_HOST, F_SAME_PATH = range(4)
def __init__(self):
self.host = None
self.visited = {}
self.targets = set()
self.threads = []
self.concurrency = 0
self.max_outstanding = 1 # ORIGINAL 16

self.follow_mode = self.F_SAME_HOST
self.content_type_filter = '(text/html)'
self.url_filters = []
self.prefix_filter = '^(#|javascript:|mailto:)'

self.targets_lock = Lock()
self.concurrency_lock = Lock()

def set_content_type_filter(self, cf):
self.content_type_filter = '(%s)' % ('|'.join(cf))

def add_url_filter(self, uf):
self.url_filters.append(uf)

def set_follow_mode(self, mode):
if mode > 5:
raise RuntimeError('invalid follow mode.')
self.follow_mode = mode

def set_concurrency_level(self, level):
self.max_outstanding = level

def process_document(self, doc):
print 'GET', doc.status, doc.url

def crawl(self, url):
self.root_url = url

rx = re.match('(https?://)([^/]+)([^\?]*)(\?.*)?', url)
self.proto = rx.group(1)
self.host = rx.group(2)
self.path = rx.group(3)
self.dir_path = dirname(self.path)
self.query = rx.group(4)

self.targets.add(url)
self._spawn_new_worker()

while self.threads:
try:
for t in self.threads:
t.join(1)
if not t.isAlive():
self.threads.remove(t)
except KeyboardInterrupt, e:
sys.exit(1)

def _url_domain(self, host):
parts = host.split('.')
if len(parts) <= 2:
return host
elif re.match('^[0-9]+(?:\.[0-9]+){3}$', host): # IP
return host
else:
return '.'.join(parts[1:])

def _follow_link(self, url, link):
# Skip prefix
if re.search(self.prefix_filter, link):
return None

# Filter url
for f in self.url_filters:
if re.search(f, link):
return None

rx = re.match('(https?://)([^/:]+)(:[0-9]+)?([^\?]*)(\?.*)?', url)
url_proto = rx.group(1)
url_host = rx.group(2)
url_port = rx.group(3) if rx.group(3) else ''
url_path = rx.group(4) if len(rx.group(4)) > 0 else '/'
url_dir_path = dirname(url_path)

rx = re.match('((https?://)([^/:]+)(:[0-9]+)?)?([^\?]*)(\?.*)?', link)
link_full_url = rx.group(1) != None
link_proto = rx.group(2) if rx.group(2) else url_proto
link_host = rx.group(3) if rx.group(3) else url_host
link_port = rx.group(4) if rx.group(4) else url_port
link_path = quote(rx.group(5), '/%') if rx.group(5) else url_path
link_query = quote(rx.group(6), '?=&%') if rx.group(6) else ''
link_dir_path = dirname(link_path)

if not link_full_url and not link.startswith('/'):
link_path = normpath(join(url_dir_path, link_path))

link_url = link_proto + link_host + link_port + link_path + link_query

if self.follow_mode == self.F_ANY:
return link_url
elif self.follow_mode == self.F_SAME_DOMAIN:
return link_host if self._url_domain(self.host) == \
self._url_domain(link.host) else None
elif self.follow_mode == self.F_SAME_HOST:
return link_url if self.host == link_host else None
elif self.follow_mode == self.F_SAME_PATH:
if self.host == link_host and \
link_dir_path.startswith(self.dir_path):
return link_url
else:
return None

def _add_target(self, target):
if not target:
return

self.targets_lock.acquire()
if self.visited.has_key(target):
self.targets_lock.release()
return
self.targets.add(target)
self.targets_lock.release()

def _spawn_new_worker(self):
self.concurrency_lock.acquire()
self.concurrency += 1
t = Thread(target=self._worker, args=(self.concurrency,))
t.daemon = True
self.threads.append(t)
t.start()
self.concurrency_lock.release()

def _worker(self, sid):
while self.targets:
try:
self.targets_lock.acquire()
url = self.targets.pop()
self.visited[url] = True
self.targets_lock.release()

rx = re.match('https?://([^/]+)(.*)', url)
host = rx.group(1)
path = rx.group(2)

conn = httplib.HTTPConnection(host, timeout=10)
conn.request('GET', path)
res = conn.getresponse()

if res.status == 301 or res.status == 302:
rlink = self._follow_link(url, res.getheader('location'))
self._add_target(rlink)
continue

# Check content type
try:
if not re.search(self.content_type_filter,
res.getheader('Content-Type')):
continue
except TypeError: # getheader result is None
continue

doc = Document(res, url)
self.process_document(doc)

# Make unique list
links = re.findall('''href\s*=\s*['"]\s*([^'"]+)['"]''',
doc.text, re.S)
links = list(set(links))

for link in links:
if re.search('''freakonomics.com/\d{4}/\d{2}/\d{2}/([a-z-]+)/$''', link, re.I): # ADDED
rlink = self._follow_link(url, link.strip())
self._add_target(rlink)

if self.concurrency < self.max_outstanding:
self._spawn_new_worker()
except KeyError as e:
# Pop from an empty set
break
except (httplib.HTTPException, EnvironmentError) as e:
#print '%s, retrying' % str(e)
self.targets_lock.acquire()
self.targets.add(url)
self.targets_lock.release()

self.concurrency_lock.acquire()
self.concurrency -= 1
self.concurrency_lock.release()

from creepy import Crawler
import re
import sys
import time
from random import randint

"""
Crawl pages that looks like podcast scripts given the URL.
** should be fixed to only crawl all links on the first page. This thing will now keep going on...
modifications has also been done in the creepy.py file by AZ Huang <aitjcize@gmail.com>
setting threads to 1
restricting the kinds of URLs allowed to be added to the queue
"""

class MyCrawler(Crawler):
def process_document(self, doc):
if doc.status == 200:
print '*** [%d] %s' % (doc.status, doc.url)
songs = re.findall('''\[\s*MUSIC\s*:([^\]]+)\]''', doc.text, re.I)
# https://pythex.org/
for song in songs:
print ("*%s") % song

sleep_time = randint(500, 2000) / 1000.0
time.sleep(sleep_time) # be nice to the server
else:
pass

crawler = MyCrawler()
crawler.set_follow_mode(Crawler.F_SAME_HOST)
crawler.add_url_filter('\.(jpg|jpeg|gif|png|js|css|swf)$')
crawler.crawl('http://freakonomics.com/radio/freakonomics-radio-podcast-archive/') # the podcast overview page

#!/usr/bin/python
# -*- coding: utf-8 -*-

"""
Simple cleaning of the result of the crawler output
usage: pipe result to a file and upload to Ivy or similar service for Spotify links
"""

import sys
import re
import HTMLParser

h = HTMLParser.HTMLParser()

try:
filename = sys.argv[1]
except IndexError:
sys.exit("Usage: %s filename") % (sys.argv[0])
with open(filename) as data:
lines = data.read().splitlines()

all_songs = []

for index, line in enumerate(lines):
if line[0:3] != "***": # just for knowing what episode the song was fetched from

line = unicode(line, "utf-8")
line = h.unescape(line)
line = re.sub('''^\*\s{0,2}''', r'', line)
line = re.sub('''<[^>]+>''', r'', line)
line = re.sub('''\([^\)]+\)''', r'', line)

line = line.replace(";",",")
line = line.replace(" -",",")

line = line.encode('ascii', 'ignore')
all_songs.append(line)

all_songs = sorted(set(all_songs))
for item in all_songs:
print item

The result of the scripts running:

crawl_result
size 180.0 KiB
sha256: 57311cd838...f6dc113136


musikk
size 4.7 KiB
sha256: 6eb04b6133...8b4bfe6a64

*** [200] http://freakonomics.com/radio/freakonomics-radio-podcast-archive/
*** [200] http://freakonomics.com/2013/11/27/a-tiny-improvement-but-still/
*** [200] http://freakonomics.com/2013/11/22/the-startup-party/
*** [200] http://freakonomics.com/2012/06/21/riding-the-herd-mentality-a-new-freakonomics-radio-podcast/
*** [200] http://freakonomics.com/2013/10/03/how-to-think-about-money-choose-your-hometown-and-buy-an-electric-toothbrush-a-new-freakonomics-radio-podcast/
* John Philip Sousa, “Manhattan Beach” (from <a href="http://www.amazon.com/gp/product/B000QQXGD6/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B000QQXGD6&amp;linkCode=as2&amp;tag=freakonomic08-20"><em>J.P. Sousa’s Marches and Dances</em></a>)
* Heavy G and the Boogaloo Communicators, “Broad Street Boogaloo” (from:<em> </em></strong><a href="http://www.amazon.com/gp/product/B000QQRSFS/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B000QQRSFS&amp;linkCode=as2&amp;tag=freakonomic08-20"><strong><em>Makin’ It Happen</em></strong></a><strong>)
* The Diplomats of Solid Sound, “Hot Stick” (from:<em> </em><a href="http://www.amazon.com/gp/product/B000QZTIS4/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B000QZTIS4&amp;linkCode=as2&amp;tag=freakonomic08-20"><em>Instrumental, Action, Soul</em></a>)
* Mark J. Scetta, “Three Men In A Tub”
* Mark J. Scetta, “Three Men In A Tub”
* </strong><a href="https://soundcloud.com/donvision"><strong>Donvision</strong></a><strong>, “Flip Flop”
* <a href="http://doriancharnis.com/Site/Welcome.html">Dorian Charnis</a>, “Modern Bebop”
* <a href="http://doriancharnis.com/Site/Welcome.html">Dorian Charnis</a>, “Modern Bebop”
* All Good Funk Alliance, “Timely Convo” (from:<em> </em><a href="http://www.amazon.com/gp/product/B00B209A2Q/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B00B209A2Q&amp;linkCode=as2&amp;tag=freakonomic08-20"><em>Social Comment</em></a>)
*** [200] http://freakonomics.com/2012/06/21/would-paying-politicians-more-attract-better-politicians/
*** [200] http://freakonomics.com/2011/08/17/new-freakonomics-radio-podcast-the-economists-guide-to-parenting/
*** [200] http://freakonomics.com/2013/10/03/how-to-think-about-money-choose-your-hometown-and-buy-an-electric-toothbrush-a-new-freakonomics-radio-podcast-full-transcript/
* John Philip Sousa, “Manhattan Beach” (from <a href="http://www.amazon.com/gp/product/B000QQXGD6/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B000QQXGD6&amp;linkCode=as2&amp;tag=freakonomic08-20"><i>J.P. Sousa’s Marches and Dances</i></a>)
---removed till I get memcache up and running---

3 Leg Torso, B  Gs
3 Leg Torso, BGs
Aaron Saloman, Hip Check
Airbus, Deep In A Dream
All Good Funk Alliance, Timely Convo
Artist Name, Song Title
Beau Blues Band, Nice and Easy
Blindfold, Rotation
Bronze Radio Return, M.O.T.R.
Cale Pellick, Sunday Stroll
Carson Henley, Fire
Christopher Norman, Cant Let Go
Clay Ross, Sixth City Waltz
Clay Ross, Street Sweep
Collective Acoustics, Does Your Laptop Have A Soul
Crytzers Blue Rhythm Band, Someday Sweetheart
D. James Goodwin, A New Team
Dan Sistos, Caravan Jam
Danielle French, Harsh Reality
Das Vibenbass, Cloak and Dagger
Das Vibenbass, The Beast
Das Vibenbass, Third Tongue
Dave Carter, Moanna
Dave Chisolm, C-Minor
Disk Eyes, Snow Angels
Donvision, Flip Flop
Donvision, Indian Summer
Donvision, Waiting For You
Dorian Charnis, Modern Bebop
Drazy Hoops, Happy Birthday To Me
Ed Hartman, Happy Marimba
Ed Hartman, Simple Life
Eleggua Productions, Diosa
Eleggua Productions, Sistema Mayoridad
Erik Janson, One More Time
Espionage, Girl From Orange County
Euforquestra, Elegua
Euforquestra, Obatala
Fairuz, Yesed Sabahak
Fooling April, Too Late
Glenn Crytzer and His Syncopators, Century Stomps
Glenn Crytzer and his Syncopators, Trepak
Glenn Crytzer and his Syncopators, Witching Hour Blues
Glenn Crytzers Savoy Seven, Focus Pocus
Green Tea, Something Like This
Greg Ruby Quartet, Zephyr
Greg Ruby, Easy for You to Say
Heavy G and The Boogaloo Communicators, Wee-Lee
Heavy G and the Boogaloo Communicators, Broad Street Boogaloo
Heavy G and the Boogaloo Communicators, Into Somethin
Heavy G and the Boogaloo Communicators, Theme From The Green Scarab
Heavy G and the Boogaloo Communicators, Wantu Wazuri
Hird, Keep You Hird
In The Nursery, Partnership
In The Nursery, Police Station
J-Hype, Isley
JOL, Life In The Sun
James King, Understand
Jason Marsalis, Hand Jivin
Jessica Lurie, Pudding
Jessica Lurie, Solitaria
John Philip Sousa, Manhattan Beach
Jonathan Geer, Draggin The Bow
Josh Bernasconi, Baby on the Beach
Louis Thorne, La Sauterelle
Louis Thorne, Mon Verrerie
Mark J. Scetta, Three Men In A Tub
Mark Petrie, Country Sunrise
Matthew Aguiluz, Binary
Melani L. Skybell, Days Like This
Melani L. Skybell, The Stars In Your Eyes
Nathan Mathes, Cheer On
Nathan Mathes, So Much Riding On
Niels Nielsen, We Are Youth
Niklas Aman, Rays of Light
Pat Andrews, 1960s Bachelor Pad
Pearl Django, Blues for Venetia
Pearl Django, Bohme Auberge
Pearl Django, Dragonfly
Pearl Django, Eleventh Hour
Pearl Django, La Rive Gauche
Pearl Django, Rhythm Oil
Pearl Django, Samba du Cabaret Rouge
Pearl Django, Saskia
Pearl Django, Seaside Adventure
Pearl Django, The Conversation
Pearl Django, Zingaerelli
Peter Mulvey, You
Phil Symonds, Caravan Cookoo
Phil Symonds, Gipsy Jacks
Phil Symonds, Rusty Tear
Reid Willis, Cub
Rob Bridgett, aurau
Ruby Velle & The Soulphonics, Looking For A Better Thing
Ruby Velle & The Soulphonics, Mr. Wrong
Ruby Velle & The Soulphonics, My Dear
Ruby Velle and The Soulphonics, Longview
Ruby Velle and The Soulphonics, The Man Says
Ruby Velle, Used Me Again
Sonogram, Certainly Obscured
Soulglue, Broken
Soulglue, Freakbus
Soulglue, Reggaeesque
Soulglue, Steve McQueen
Soulphonic Soundsystem, Mr. Sparkle
Spencer Garn, Deco Nuevo
Spencer Garn, Funky Zapatos
Spencer Garn, Living In Harmony
Spencer Garn, Pink Champagne Paradise Machine
Stephen Flinn, Jewels In My Teeth
Susie Ibarra, Azul
Susie Ibarra, Fractal 4
Susie Ibarra, The Dance
Tangria Jazz Group, Breathe Easy
Tangria Jazz Group, Ethans Song
Teddy Presberg, 82nd Ave Strut
Teddy Presberg, Free Love
Teddy Presberg, Juicy Peach
Teddy Presberg, Outcries From A Sea Of Red
Teddy Presberg, Sunrise on St. Johns
Texas Gypsies, Maxwell Swing
The Diplomats of Solid Sound, Bullfrog Bugaloo
The Diplomats of Solid Sound, Dont Touch My Popcorn
The Diplomats of Solid Sound, Growin In It
The Diplomats of Solid Sound, Hot Stick
The Diplomats of Solid Sound, Pistol Alien
The Diplomats of Solid Sound, Shadow Of Your Soul
The Diplomats of Solid Sound, The Cuber Bake
The Jaguars, By By Mai Thai
The Jaguars, Leave Me Alone
The Jaguars, Snake Charmer
The Jaguars, The Swagger
The Mackrosoft, Angiogenesis
The Mackrosoft, Bolero
The Mackrosoft, The Immortality Project
The Mackrosoft, Three Views Of A Secret
The Morrie Morrison Orchestra Get Away Get Away Get Away Get Away From Me Now
The Rosewood Project, Never Coming Down
The San Andreas Fault Encantada
The San Andreas Fault, Sympatico
The Sound Room, Just Cant Help It
The Tiptons Saxophone Quartet, Laws of Motion
The Willie August Project, Diamonds in the Darkness )
The Willie August Project, Suite for a Dancer, Movement 5
Two Dark Birds, Pie Eyed
Two Dark Birds, Run For Daylight
Two Dark Birds, Start All Over Again
Vagabond Opera, Goodnight Moon
Vagabond Opera, Hanumonsoon
Vunt Foom, Beatcutter
Vunt Foom, Grease
Winston Giles Orchestra, Over And Out