During a JAWS for Windows training, I was introduced to the Research It feature of that screen reader. Research It is a quick way to utilize web scraping to make working with complex web pages easier. It is about extracting specific information from a website that does not offer an API. For instance, look up a word in an online dictionary, or quickly check the status of a delivery. Strictly speaking, this feature does not belong in a screen reader, but it is a very helpful tool to have at your fingertips.
Research It uses XQuery (actually, XQilla) to do all the heavy lifting. This also means that the Research It Rulesets are theoretically also useable on other platforms. I was immediately hooked, because I always had a love for XPath. Looking at XQuery code is totally self-explanatory for me. I just like the syntax and semantics.
So I immediately checked out XQilla on Debian, and found #821329 and #821330, which were promptly fixed by Tommi Vainikainen, thanks to him for the really quick response!
Unfortunately, making xqilla:parse-html available and upgrading to the latest upstream version is not enough to use XQilla on Linux with the typical webpages out there. Xerces-C++, which is what XQilla uses to fetch web resources, does not support HTTPS URLs at the moment. I filed #821380 to ask for HTTPS support in Xerces-C to be enabled by default.
And even with HTTPS support enabled in Xerces-C, the xqilla:parse-html function (which is based on HTML Tidy) fails for a lot of real-world webpages I tried. Manually upgrading the six year old version of HTML Tidy in Debian to the latest from GitHub (tidy-html5, #810951) did not help a lot either.
Python to the rescue
XQuery is still a very nice language for extracting information from markup documents. XQilla just has a bit of a hard time dealing with the typical HTML documents out there. After all, it was designed to deal with well-formed XML documents.
So I decided to build a little wrapper around XQilla which fetches the web resources with the Python Requests package, and cleans the HTML document with BeautifulSoup (which uses lxml to do HTML parsing). The output of BeautifulSoup can apparently be passed to XQilla as the context document. This is a fairly crazy hack, but it works quite reliably so far.
Here is how one of my web scraping rules looks like:
from click import argument, group
@group()
def xq():
"""Web scraping for command-line users."""
pass
@xq.group('github.com')
def github():
"""Quick access to github.com."""
pass
@github.command('code_search')
@argument('language')
@argument('query')
def github_code_search(language, query):
"""Search for source code."""
='https://github.com/search',
scrape(get={'l': language, 'q': query, 'type': 'code'}) params
The function scrape automatically determines the XQuery filename according to the callers function name. Here is how github_code_search.xq looks like:
declare function local:source-lines($table as node()*) as xs:string*
{
for $tr in $table/tr return normalize-space(data($tr))
};
let $results := html//div[@id="code_search_results"]/div[@class="code-list"]
for $div in $results/div
let $repo := data($div/p/a[1])
let $file := data($div/p/a[2])
let $link := resolve-uri(data($div/p/a[2]/@href))
return (concat($repo, ": ", $file), $link, local:source-lines($div//table),
"---------------------------------------------------------------")
That is all I need to implement a custom web scraping rule. A few lines of Python to specify how and where to fetch the website from. And a XQuery file that specifies how to mangle the document content.
And thanks to the Python click package, the various entry points of my web scraping script can easily be called from the command-line.
Here is a sample invokation:
fx:~/xq% ./xq.py github.com
Usage: xq.py github.com [OPTIONS] COMMAND [ARGS]...
Quick access to github.com.
Options:
--help Show this message and exit.
Commands:
code_search Search for source code.
fx:~/xq% ./xq.py github.com code_search Pascal '"debian/rules"'
prof7bit/LazPackager: frmlazpackageroptionsdeb.pas
https://github.com/prof7bit/LazPackager/blob/cc3e35e9bae0c5a582b0b301dcbb38047fba2ad9/frmlazpackageroptionsdeb.pas
230 procedure TFDebianOptions.BtnPreviewRulesClick(Sender: TObject);
231 begin
232 ShowPreview('debian/rules', EdRules.Text);
233 end;
234
235 procedure TFDebianOptions.BtnPreviewChangelogClick(Sender: TObject);
---------------------------------------------------------------
prof7bit/LazPackager: lazpackagerdebian.pas
https://github.com/prof7bit/LazPackager/blob/cc3e35e9bae0c5a582b0b301dcbb38047fba2ad9/lazpackagerdebian.pas
205 + 'mv ../rules debian/' + LF
206 + 'chmod +x debian/rules' + LF
207 + 'mv ../changelog debian/' + LF
208 + 'mv ../copyright debian/' + LF
---------------------------------------------------------------
For the impatient, here is the implementation of `scrape`:
from bs4 import BeautifulSoup
from bs4.element import Doctype, ResultSet
from inspect import currentframe
from itertools import chain
from os import path
from os.path import abspath, dirname
from subprocess import PIPE, run
from tempfile import NamedTemporaryFile
import requests
def scrape(get=None, post=None, find_all=None,
=None, xquery_vars={}, **kwargs):
xquery_name"""Execute a XQuery file.
When either get or post is specified, fetch the resource and run it through
BeautifulSoup, passing it as context to the XQuery.
If find_all is given, wrap the result of executing find_all on
the BeautifulSoup in an artificial HTML body.
If xquery_name is not specified, the callers function name is used.
xquery_name combined with extension ".xq" is searched in the directory
where this Python script resides and executed with XQilla.
kwargs are passed to get or post calls. Typical extra keywords would be:
params -- To pass extra parameters to the URL.
data -- For HTTP POST.
"""
= None
response = None
url = None
context
if get is not None:
= requests.get(get, **kwargs)
response elif post is not None:
= requests.post(post, **kwargs)
response
if response is not None:
response.raise_for_status()= BeautifulSoup(response.text, 'lxml')
context = next(context.descendants)
dtd if type(dtd) is Doctype:
dtd.extract()if find_all is not None:
= context.find_all(find_all)
context = response.url
url
if xquery_name is None:
= currentframe().f_back.f_code.co_name
xquery_name = ['xqilla']
cmd if context is not None:
if type(context) is BeautifulSoup:
= context
soup = NamedTemporaryFile(mode='w')
context print(soup, file=context)
'-i', context.name])
cmd.extend([elif isinstance(context, list) or isinstance(context, ResultSet):
= context
tags = NamedTemporaryFile(mode='w')
context print('<html><body>', file=context)
for item in tags: print(item, file=context)
print('</body></html>', file=context)
context.flush()'-i', context.name])
cmd.extend(['-v', k, v] for k, v in xquery_vars.items()))
cmd.extend(chain.from_iterable([if url is not None:
'-b', url])
cmd.extend([__file__), xquery_name + ".xq")))
cmd.append(abspath(path.join(dirname(
= run(cmd, stdout=PIPE).stdout.decode('utf-8')
output if type(context) is NamedTemporaryFile: context.close()
print(output, end='')
The full source for xq can be found on GitHub. The project is just two days old, so I have only implemented three scraping rules as of now. However, adding new rules has been made deliberately easy, so that I can just write up a few lines of code whenever I find something on the web which I’d like to scrape on the command-line. If you find this “framework” useful, make sure to share your insights with me. And if you impelement your own scraping rules for a public service, consider sharing that as well.
If you have an comments or questions, send me mail. Oh, and by the way, I am now also on Twitter as @blindbird23.