Scraping the web with Python and XQuery

During a JAWS for Windows training, I was introduced to the Research It feature of that screen reader. Research It is a quick way to utilize web scraping to make working with complex web pages easier. It is about extracting specific information from a website that does not offer an API. For instance, look up a word in an online dictionary, or quickly check the status of a delivery. Strictly speaking, this feature does not belong in a screen reader, but it is a very helpful tool to have at your fingertips.

Research It uses XQuery (actually, XQilla) to do all the heavy lifting. This also means that the Research It Rulesets are theoretically also useable on other platforms. I was immediately hooked, because I always had a love for XPath. Looking at XQuery code is totally self-explanatory for me. I just like the syntax and semantics.

So I immediately checked out XQilla on Debian, and found #821329 and #821330, which were promptly fixed by Tommi Vainikainen, thanks to him for the really quick response!

Unfortunately, making xqilla:parse-html available and upgrading to the latest upstream version is not enough to use XQilla on Linux with the typical webpages out there. Xerces-C++, which is what XQilla uses to fetch web resources, does not support HTTPS URLs at the moment. I filed #821380 to ask for HTTPS support in Xerces-C to be enabled by default.

And even with HTTPS support enabled in Xerces-C, the xqilla:parse-html function (which is based on HTML Tidy) fails for a lot of real-world webpages I tried. Manually upgrading the six year old version of HTML Tidy in Debian to the latest from GitHub (tidy-html5, #810951) did not help a lot either.

Python to the rescue

XQuery is still a very nice language for extracting information from markup documents. XQilla just has a bit of a hard time dealing with the typical HTML documents out there. After all, it was designed to deal with well-formed XML documents.

So I decided to build myself a little wrapper around XQilla which fetches the web resources with the Python Requests package, and cleans the HTML document with BeautifulSoup (which uses lxml to do HTML parsing). The output of BeautifulSoup can apparently be passed to XQilla as the context document. This is a fairly crazy hack, but it works quite reliably so far.

Here is how one of my web scraping rules looks like:

from click import argument, group

@group()
def xq():
  """Web scraping for command-line users."""
  pass

@xq.group('github.com')
def github():
  """Quick access to github.com."""
  pass

@github.command('code_search')
@argument('language')
@argument('query')
def github_code_search(language, query):
  """Search for source code."""
  scrape(get='https://github.com/search',
         params={'l': language, 'q': query, 'type': 'code'})

The function scrape automatically determines the XQuery filename according to the callers function name. Here is how github_code_search.xq looks like:

declare function local:source-lines($table as node()*) as xs:string*
{
  for $tr in $table/tr return normalize-space(data($tr))
};

let $results := html//div[@id="code_search_results"]/div[@class="code-list"]
for $div in $results/div
let $repo := data($div/p/a[1])
let $file := data($div/p/a[2])
let $link := resolve-uri(data($div/p/a[2]/@href))
return (concat($repo, ": ", $file), $link, local:source-lines($div//table),
        "---------------------------------------------------------------")

That is all I need to implement a custom web scraping rule. A few lines of Python to specify how and where to fetch the website from. And a XQuery file that specifies how to mangle the document content.

And thanks to the Python click package, the various entry points of my web scraping script can easily be called from the command-line.

Here is a sample invokation:

fx:~/xq% ./xq.py github.com
Usage: xq.py github.com [OPTIONS] COMMAND [ARGS]...

  Quick access to github.com.

Options:
  --help  Show this message and exit.

Commands:
  code_search  Search for source code.

fx:~/xq% ./xq.py github.com code_search Pascal '"debian/rules"'
prof7bit/LazPackager: frmlazpackageroptionsdeb.pas
https://github.com/prof7bit/LazPackager/blob/cc3e35e9bae0c5a582b0b301dcbb38047fba2ad9/frmlazpackageroptionsdeb.pas
230 procedure TFDebianOptions.BtnPreviewRulesClick(Sender: TObject);
231 begin
232 ShowPreview('debian/rules', EdRules.Text);
233 end;
234
235 procedure TFDebianOptions.BtnPreviewChangelogClick(Sender: TObject);
---------------------------------------------------------------
prof7bit/LazPackager: lazpackagerdebian.pas
https://github.com/prof7bit/LazPackager/blob/cc3e35e9bae0c5a582b0b301dcbb38047fba2ad9/lazpackagerdebian.pas
205 + 'mv ../rules debian/' + LF
206 + 'chmod +x debian/rules' + LF
207 + 'mv ../changelog debian/' + LF
208 + 'mv ../copyright debian/' + LF
---------------------------------------------------------------

For the impatient, here is the implementation of scrape:

from bs4 import BeautifulSoup
from bs4.element import Doctype, ResultSet
from inspect import currentframe
from itertools import chain
from os import path
from os.path import abspath, dirname
from subprocess import PIPE, run
from tempfile import NamedTemporaryFile

import requests

def scrape(get=None, post=None, find_all=None,
           xquery_name=None, xquery_vars={}, **kwargs):
  """Execute a XQuery file.
  When either get or post is specified, fetch the resource and run it through
  BeautifulSoup, passing it as context to the XQuery.
  If find_all is given, wrap the result of executing find_all on
  the BeautifulSoup in an artificial HTML body.
  If xquery_name is not specified, the callers function name is used.
  xquery_name combined with extension ".xq" is searched in the directory
  where this Python script resides and executed with XQilla.
  kwargs are passed to get or post calls.  Typical extra keywords would be:
  params -- To pass extra parameters to the URL.
  data -- For HTTP POST.
  """

  response = None
  url = None
  context = None

  if get is not None:
    response = requests.get(get, **kwargs)
  elif post is not None:
    response = requests.post(post, **kwargs)

  if response is not None:
    response.raise_for_status()
    context = BeautifulSoup(response.text, 'lxml')
    dtd = next(context.descendants)
    if type(dtd) is Doctype:
      dtd.extract()
    if find_all is not None:
      context = context.find_all(find_all)
    url = response.url

  if xquery_name is None:
    xquery_name = currentframe().f_back.f_code.co_name
  cmd = ['xqilla']
  if context is not None:
    if type(context) is BeautifulSoup:
      soup = context
      context = NamedTemporaryFile(mode='w')
      print(soup, file=context)
      cmd.extend(['-i', context.name])
    elif isinstance(context, list) or isinstance(context, ResultSet):
      tags = context
      context = NamedTemporaryFile(mode='w')
      print('<html><body>', file=context)
      for item in tags: print(item, file=context)
      print('</body></html>', file=context)
      context.flush()
      cmd.extend(['-i', context.name])
  cmd.extend(chain.from_iterable(['-v', k, v] for k, v in xquery_vars.items()))
  if url is not None:
    cmd.extend(['-b', url])
  cmd.append(abspath(path.join(dirname(__file__), xquery_name + ".xq")))

  output = run(cmd, stdout=PIPE).stdout.decode('utf-8')
  if type(context) is NamedTemporaryFile: context.close()

  print(output, end='')

The full source for xq can be found on GitHub. The project is just two days old, so I have only implemented three scraping rules as of now. However, adding new rules has been made deliberately easy, so that I can just write up a few lines of code whenever I find something on the web which I'd like to scrape on the command-line. If you find this "framework" useful, make sure to share your insights with me. And if you impelement your own scraping rules for a public service, consider sharing that as well.

If you have an comments or questions, send me mail. Oh, and by the way, I am now also on Twitter as @blindbird23.

blogroll

social

Github Activity