Getting started with Scrapy on Mac OS X High Sierra

This post documents installing Scrapy on a MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports) running macOS High Sierra. The TL;DR setup steps would be:

sudo easy_install pip
sudo pip install virtualenv
Mkdir a directory to be used with scrapy and activate virtualenvsource ENV/bin/activate
pip install scrapy into this virtualenv
Get started with scrapy scrapy startproject craigslist by following a tutorial e.g. scraping Craigslist

The long, real-time documented version including mildly funny commentary can be found below.

I am on High Sierra, 10.13.5, nowadays and my laptop is so futuristic it has a touchy touch bar that no one has found a really good use for yet, and the buses are so fresh that no contemporary device can physically connect to it.
However, today I am trying to scrape some web with this amazing shiny piece of aluminium.
After having worked with lots of manual scraping techniques from the command line, like urrlib2 and Beautiful soup and whatnot, today I will give Scrapy a go.

The cute scrapy spatula icon gives me hope, that this “open source and collaborative framework for extracting the data you need from websites” really works “In a fast, simple, yet extensible way.” So here we go.

This computer is pretty much naked when it comes to any useful command line tools so to install Scrapy I need pip first.
sudo easy_install pip
Bummer the fingerprint reader from the future doesn’t work on the command line and I actually need to type out my very safe and strong password. A couple of lines printed in my window. So far so good.
pip install scrapy
~~A couple of lines dumped in my terminal suggest this is working out until I get some red warnings around pillow, nose and tornado. Dependencies, here we come…~~
~~sudo pip install pillow~~
~~sudo pip install nose~~
~~sudo pip install tornado~~
~~Unclear if that just worked? – No, not really.~~
~~Cannot uninstall 'six'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.~~

OK, fail, back to start. I take the time and actually read a bit more. It is recommended to install scrapy in a virtual environment. Well here we go.

sudo pip install virtualenv
So far so good, let’s just ignore the warnings for now.
Looks good. I got my project folder /Users/.../scrapy so this is where I create my virtual environment ENV by virtualenv ENV
I activaye this environment with source ENV/bin/activate
Here we go. I am in my virtual envirnoment. Hurrah. So what happens when I install Scrapy in here? pip install scrapy
My terminal joyously floods with words and vintagely appealing ASCII progress bars until we gleefully end up having installed all kinds of dubious packages by the names of Automat-0.7.0 PyDispatcher-2.0.5 PyHamcrest-1.9.0 Twisted-18.7.0 asn1crypto-0.24.0 attrs-18.1.0 cffi-1.11.5 constantly-15.1.0 cryptography-2.3 cssselect-1.0.3 enum34-1.1.6 functools32-3.2.3.post2 hyperlink-18.0.0 idna-2.7 incremental-17.5.0 ipaddress-1.0.22 lxml-4.2.3 parsel-1.5.0 pyOpenSSL-18.0.0 pyasn1-0.4.3 pyasn1-modules-0.2.2 pycparser-2.18 queuelib-1.5.0 scrapy-1.5.1 service-identity-17.0.0 six-1.11.0 w3lib-1.19.0 zope.interface-4.5.0
I follow the Getting Started tutorial scrapy startproject craigslist and end up getting a message that confirms the successful creation of the Craigslist Scrapy project.

Working on a forked repo

I found this wonderful repository for adding Photos to a leaflet layer.

The original project by Github user turban appears abandoned but user HeikkiVesanto has forked the repo and added neat functionality, the use of a local directory for photos which allows a complete self-hosted solution (the original project used Instagram and the [also abandoned] Picasa).

To ensure longevity of my project I need a self-hostable solution, however, when I fork the repo (even when on the desired forked branch) I only get the original author’s work and not the more recent project. The network diagram shows that the version I forked is older and misses some work from other contributors.

Screen Shot 2017-04-26 at 4.12.42 PM

So I forked the repo and then cloned it to my local machine.

After some trial and error it was a quick fix to actually switch my local repo to the other author’s more recent fork:
Making sure I’m on the correct branch (in this case gh-pages)

git checkout gh-pages

Then pull the repo of the author:

git pull https://github.com/HeikkiVesanto/Leaflet.Photo gh-pages

and then push those back to my Github with

git push origin gh-pages

And

git status

happily returns

On branch gh-pages
Your branch is up-to-date with 'origin/gh-pages'.
nothing to commit, working tree clean

And the resulting graph shows that my project is now ahead of the original and in tune with the most recent changes:

Screen Shot 2017-04-26 at 4.15.10 PM

Quick folders from List

Textfile with names a la

Ada Lovelace
Bill Gates
Steve Jobs

goes into a list of directories with

awk '{print $1 $2}' myfile | xargs mkdir

looking like

AdaLovelace
BillGates
SteveJobs

(No one likes spaces in file names anyways)

Mac OS install error: could not create ‘/anything/for/you’: Permission denied

Happens with brew install, pip install etc… and then if I grumpily hammer into my terminal

 sudo brew install the-software-i-need-right-now

brew kindly replies

Error: Running Homebrew as root is extremely dangerous and no longer supported.
As Homebrew does not drop privileges on installation you would be giving all
build scripts full access to your system.

and I furiously stare at the screen. How do I get my permissions back? Keep calm and type

sudo chown -R `whoami`:admin /this/is/my/computer/
and/i/want/to/install/malware/in/this/directory/
and/macosx/cant/stop/me

Because I like danger
Phew.

PS. Dear WordPress. Line breaks in long rant-directory-names are uncool.

MongoDB Mac Quickstart

I wanted to test Mongodb and had no idea how to quickstart it.

Install

I used homebrew. So what I did was

brew update
brew install mongodb

Fine.

Start server

How do I use/start it? OK. I need two terminals open, one as server on as console.

The server I started with

mongod -dbpath $(pwd)/data/db

after I manually created the data/db structure in my working/testing directory. No idea where the default of this structure should be but apparently that worked.

Start console

The console in the other terminal starts with

mongo

and I create a database with

use dbname

So far so good. And now?

I actually want to connect to this database with Robomongo but the application just keeps crashing since I started server and console. Boo.

Quick EXIF look video format data one-liner

Ever wondered what format all the videos in your Video folder have?
I need this sort of information for video work to see whether I have all the footage in the right format (and can contact collaborators to re-export/re-shoot if not).
Looking through every single file by hand can be tiring when dealing with more than a handful of files,
so finally I managed to compile this one-liner that does all the work for me.

This script requires you to have ExifTool installed, which is a great tool anyways if you love image metadata.

find * \( -name '*mp4' -o -name '*mov' \) -exec exiftool -T -imagesize -videoframerate -compressorname {} \;

or with a filename

find * \( -name '*mp4' -o -name '*mov' \) -exec exiftool -T -filename -imagesize -videoframerate -compressorname {} \;

Without the filename disclosed an output of one of my folders looks like that, uh oh, I spot some DV PAL there:

1920x1080 25.000000 H.264 1920x1080 25.000000 AVC Coding 1920x1080 25.000000 H.264 1920x1280 25.000000 Apple ProRes 4444 1920x1080 25.000000 AVC Coding 1920x1080 23.976024 H.264 1920x1080 24.000000 H.264 1920x1080 25 AVC Coding 1920x1080 25.000000 H.264 1920x1080 25.000000 H.264 1920x1080 23.976024 H.264 1920x1080 24 H.264 1920x1080 25.000000 AVC Coding 1920x1080 24.000000 AVC Coding 1920x1080 25.000000 AVC Coding 1920x1080 25.000000 H.264 1920x1080 25.000000 AVC Coding 1920x1080 25.000000 H.264 1920x1080 25 AVC Coding 1920x1080 25.000000 H.264 1920x1080 25.000000 AVC Coding 1920x1080 25.000000 Apple ProRes 4444 1920x1080 25 Apple ProRes 422 1920x1080 25.000000 H.264 1920x1080 25.000000 AVC Coding 1920x1080 24.000000 H.264 1920x1080 29.970030 AVC Coding 1920x1080 25.000000 AVC Coding 1920x1080 25.000000 H.264 1920x1080 25.000000 H.264 1920x1080 30.000000 H.264 1280x720 25.000000 H.264 1920x1080 25.000000 AVC Coding 1920x1080 25.000000 H.264 1920x1080 25.000000 H.264 1920x1080 25.000000 H.264 1920x1080 25.000000 AVC Coding 1920x1080 25.000000 AVC Coding 1920x1080 25.000000 H.264 1920x1080 29.970030 Animation 1920x1080 25.000000 AVC Coding 1920x1080 30.000000 AVC Coding 720x576 25.000000 DVCPRO - PAL 1280x720 24.000000 H.264 1920x1080 25.000000 Apple ProRes 4444 1920x1080 25 AVC Coding 1920x1080 25 AVC Coding 1280x720 25.000000 H.264 1280x720 24.000000 AVC Coding 1920x1080 24.000000 AVC Coding 1920x1080 25.000000 AVC Coding 1920x1080 25.000000 AVC Coding 1920x1280 25.000000 Apple ProRes 4444 1280x720 25 -

Thanks to Nate on Askubuntu and The Wolf on Unix Stackexchange who’s answers to someone elses questions help me figure out the correct syntax for this code.

Update 2019: I want to exclude dotfiles as well to have a cleaner result, so I use:

find * -not -path '*/\.*' \( -name '*mp4' -o -name '*mov' \) -exec exiftool -T -filename -imagesize -videoframerate -compressorname {} \;

Filmm’s answer om StackExchange helped me with the additional command.

Homebrew – The missing package manager for OS X

https://github.com/mxcl/homebrew/
http://brew.sh/

merglindev

Category Command Lines