Getting started with Scrapy on Mac OS X High Sierra

This post documents installing Scrapy on a MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports) running macOS High Sierra. The TL;DR setup steps would be:

  1. sudo easy_install pip
  2. sudo pip install virtualenv
  3. Mkdir a directory to be used with scrapy and activate virtualenvsource ENV/bin/activate
  4. pip install scrapy into this virtualenv
  5. Get started with scrapy scrapy startproject craigslist by following a tutorial e.g. scraping Craigslist

The long, real-time documented version including mildly funny commentary can be found below.

I am on High Sierra, 10.13.5, nowadays and my laptop is so futuristic it has a touchy touch bar that no one has found a really good use for yet, and the buses are so fresh that no contemporary device can physically connect to it.
However, today I am trying to scrape some web with this amazing shiny piece of aluminium.
After having worked with lots of manual scraping techniques from the command line, like urrlib2 and Beautiful soup and whatnot, today I will give Scrapy a go.
Logo of Scrapy showing a scrapy spatula
The cute scrapy spatula icon gives me hope, that this “open source and collaborative framework for extracting the data you need from websites” really works “In a fast, simple, yet extensible way.” So here we go.

  1. This computer is pretty much naked when it comes to any useful command line tools so to install Scrapy I need pip first.
    sudo easy_install pip
    Bummer the fingerprint reader from the future doesn’t work on the command line and I actually need to type out my very safe and strong password. A couple of lines printed in my window. So far so good.
  2. pip install scrapy
    A couple of lines dumped in my terminal suggest this is working out until I get some red warnings around pillow, nose and tornado. Dependencies, here we come…
  3. sudo pip install pillow
    sudo pip install nose
    sudo pip install tornado
    Unclear if that just worked? – No, not really.
  4. Cannot uninstall 'six'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

OK, fail, back to start. I take the time and actually read a bit more. It is recommended to install scrapy in a virtual environment. Well here we go.

  1. sudo pip install virtualenv
    So far so good, let’s just ignore the warnings for now.
  2. Looks good. I got my project folder /Users/.../scrapy so this is where I create my virtual environment ENV by virtualenv ENV
  3. I activaye this environment with source ENV/bin/activate
  4. Here we go. I am in my virtual envirnoment. Hurrah. So what happens when I install Scrapy in here? pip install scrapy
    My terminal joyously floods with words and vintagely appealing ASCII progress bars until we gleefully end up having installed all kinds of dubious packages by the names of Automat-0.7.0 PyDispatcher-2.0.5 PyHamcrest-1.9.0 Twisted-18.7.0 asn1crypto-0.24.0 attrs-18.1.0 cffi-1.11.5 constantly-15.1.0 cryptography-2.3 cssselect-1.0.3 enum34-1.1.6 functools32-3.2.3.post2 hyperlink-18.0.0 idna-2.7 incremental-17.5.0 ipaddress-1.0.22 lxml-4.2.3 parsel-1.5.0 pyOpenSSL-18.0.0 pyasn1-0.4.3 pyasn1-modules-0.2.2 pycparser-2.18 queuelib-1.5.0 scrapy-1.5.1 service-identity-17.0.0 six-1.11.0 w3lib-1.19.0 zope.interface-4.5.0
  5. I follow the Getting Started tutorial scrapy startproject craigslist and end up getting a message that confirms the successful creation of the Craigslist Scrapy project.

Working on a forked repo

I found this wonderful repository for adding Photos to a leaflet layer.

The original project by Github user turban appears abandoned but user HeikkiVesanto has forked the repo and added neat functionality, the use of a local directory for photos which allows a complete self-hosted solution (the original project used Instagram and the [also abandoned] Picasa).

To ensure longevity of my project I need a self-hostable solution, however, when I fork the repo (even when on the desired forked branch) I only get the original author’s work and not the more recent project. The network diagram shows that the version I forked is older and misses some work from other contributors.

Screen Shot 2017-04-26 at 4.12.42 PM

So I forked the repo and then cloned it to my local machine.

After some trial and error it was a quick fix to actually switch my local repo to the other author’s more recent fork:
Making sure I’m on the correct branch (in this case gh-pages)

git checkout gh-pages

Then pull the repo of the author:

git pull https://github.com/HeikkiVesanto/Leaflet.Photo gh-pages

and then push those back to my Github with

git push origin gh-pages

And

git status

happily returns

On branch gh-pages
Your branch is up-to-date with 'origin/gh-pages'.
nothing to commit, working tree clean

And the resulting graph shows that my project is now ahead of the original and in tune with the most recent changes:

Screen Shot 2017-04-26 at 4.15.10 PM

Mac OS install error: could not create ‘/anything/for/you’: Permission denied

Happens with brew install, pip install etc… and then if I grumpily hammer into my terminal

 sudo brew install the-software-i-need-right-now

brew kindly replies

Error: Running Homebrew as root is extremely dangerous and no longer supported.
As Homebrew does not drop privileges on installation you would be giving all
build scripts full access to your system.

and I furiously stare at the screen. How do I get my permissions back? Keep calm and type

sudo chown -R `whoami`:admin /this/is/my/computer/
and/i/want/to/install/malware/in/this/directory/
and/macosx/cant/stop/me

Because I like danger
Phew.

PS. Dear WordPress. Line breaks in long rant-directory-names are uncool.

MongoDB Mac Quickstart

 

I wanted to test Mongodb and had no idea how to quickstart it.

Install

I used homebrew. So what I did was

brew update
brew install mongodb

Fine.

Start server

How do I use/start it? OK. I need two terminals open, one as server on as console.

The server I started with

mongod -dbpath $(pwd)/data/db

after I manually created the data/db structure in my working/testing directory. No idea where the default of this structure should be but apparently that worked.

Start console

The console in the other terminal starts with

mongo

and I create a database with

use dbname

So far so good. And now?

I actually want to connect to this database with Robomongo but the application just keeps crashing since I started server and console. Boo.

 

Quick EXIF look video format data one-liner

Ever wondered what format all the videos in your Video folder have?
I need this sort of information for video work to see whether I have all the footage in the right format (and can contact collaborators to re-export/re-shoot if not).
Looking through every single file by hand can be tiring when dealing with more than a handful of files,
so finally I managed to compile this one-liner that does all the work for me.

This script requires you to have ExifTool installed, which is a great tool anyways if you love image metadata.

find * \( -name '*mp4' -o -name '*mov' \) -exec exiftool -T -imagesize -videoframerate -compressorname {} \;

or with a filename

find * \( -name '*mp4' -o -name '*mov' \) -exec exiftool -T -filename -imagesize -videoframerate -compressorname {} \;

Without the filename disclosed an output of one of my folders looks like that, uh oh, I spot some DV PAL there:

1920x1080 25.000000 H.264
1920x1080 25.000000 AVC Coding
1920x1080 25.000000 H.264
1920x1280 25.000000 Apple ProRes 4444
1920x1080 25.000000 AVC Coding
1920x1080 23.976024 H.264
1920x1080 24.000000 H.264
1920x1080 25 AVC Coding
1920x1080 25.000000 H.264
1920x1080 25.000000 H.264
1920x1080 23.976024 H.264
1920x1080 24 H.264
1920x1080 25.000000 AVC Coding
1920x1080 24.000000 AVC Coding
1920x1080 25.000000 AVC Coding
1920x1080 25.000000 H.264
1920x1080 25.000000 AVC Coding
1920x1080 25.000000 H.264
1920x1080 25 AVC Coding
1920x1080 25.000000 H.264
1920x1080 25.000000 AVC Coding
1920x1080 25.000000 Apple ProRes 4444
1920x1080 25 Apple ProRes 422
1920x1080 25.000000 H.264
1920x1080 25.000000 AVC Coding
1920x1080 24.000000 H.264
1920x1080 29.970030 AVC Coding
1920x1080 25.000000 AVC Coding
1920x1080 25.000000 H.264
1920x1080 25.000000 H.264
1920x1080 30.000000 H.264
1280x720 25.000000 H.264
1920x1080 25.000000 AVC Coding
1920x1080 25.000000 H.264
1920x1080 25.000000 H.264
1920x1080 25.000000 H.264
1920x1080 25.000000 AVC Coding
1920x1080 25.000000 AVC Coding
1920x1080 25.000000 H.264
1920x1080 29.970030 Animation
1920x1080 25.000000 AVC Coding
1920x1080 30.000000 AVC Coding
720x576 25.000000 DVCPRO - PAL
1280x720 24.000000 H.264
1920x1080 25.000000 Apple ProRes 4444
1920x1080 25 AVC Coding
1920x1080 25 AVC Coding
1280x720 25.000000 H.264
1280x720 24.000000 AVC Coding
1920x1080 24.000000 AVC Coding
1920x1080 25.000000 AVC Coding
1920x1080 25.000000 AVC Coding
1920x1280 25.000000 Apple ProRes 4444
1280x720 25 -

Thanks to Nate on Askubuntu and The Wolf on Unix Stackexchange who’s answers to someone elses questions help me figure out the correct syntax for this code.

Update 2019: I want to exclude dotfiles as well to have a cleaner result, so I use:

find * -not -path '*/\.*' \( -name '*mp4' -o -name '*mov' \) -exec exiftool -T -filename -imagesize -videoframerate -compressorname {} \;

Filmm’s answer om StackExchange helped me with the additional command.