Every new year, I tell myself that this is the year I'm going to start blogging again. Sometimes, I even write posts saying "this year will be different," and then that ends up being my only post of the year.
I used to love blogging - I had a relatively popular blog on ScienceBlogs.com called "We, Beasties" - a not-so-subtle homage to Isaac Asimov and Paul de Kruif - which ended in 2013 when I got recruited to start an ill-fated blog at Scientific American.
What I do now is very different than what I was doing in 2013, but I'm always thinking I should get back into it. What better way to do that than to use my new skills to collect my old content?
"Scraping" old posts
Initially, I was expecting to have to do some complicated html parsing, crawling through each post and finding back links, but it turns out the old scienceblogs.com website is actually pretty well laid out.
First, I went to my author page, which lists all of my posts in pagenated form. In the browser, I right-clicked in the page, and selected "view source" to see the underlying html (which is how the page will be downloaded).
A few things jump out right away:
- The blocks that contain links to previous posts look like
<h5 class="field-content"><a href="/webeasties/2013/09/03/we-beasties-sproulates" hreflang="und">We, Beasties Sporulates</a></h5>
- The individual pages of posts all have the same url as the author page,
ending in
?page=N
, whereN
is a number0:7
.
So really, all I need to do is download each of those pages,
search for all of the webeasties
urls,
and then download all of those pages.
Pretty simple. The code in this post is all run with julia Version 1.6.0-DEV.1722 (2020-12-09) (but should work on any julia v1). project files can be found here. In code blocks below, I have a mix of script-like and REPL commans, the latter are just to easily show outputs. I do most of my julia coding and running using the VS Code julia extension.
Looping through author pages
Since each author page has the same url save one number,
I just cycled through them in a loop of the numbers.
In each case, I downloaded the page into a temporary file,
scan it for webeasties
urls, and store them in a set.
Note: you should really never use Regex to parse an HTML file, but I'm not really parsing them, I actually am looking for a simple pattern.
using Downloads: download
posturls = Set(String[])
baseurl = "https://scienceblogs.com/author/kbonham?page="
projectdir = normpath(pwd(), "_assets/literate/webeasties/")
isdir(projectdir) || mkpath(projectdir)
# for p in 0:7
let p = 0
tmp = download(baseurl*string(p))
# for line in eachline(tmp)
for line in first(eachline(tmp), 5)
for m in eachmatch(r"href=\"(/webeasties[\w/\-]+)\"", line)
push!(posturls, m[1])
end
end
end
posturls
Set{String}()
length(posturls) # for the full set, this is 192
0
Explanation of regex:
href=\"/webeasties
: hoepfully self-explanatory, thehref
looks specifically for links, and the"
needs to be escaped.[\w/\-]+
: match any number of word (a-z
,A-Z
,0-9
, and_
) characters, forward slashes or dashes\"
closing out the string.
Downloading html pages
The next bit is using the same idea, except I suspected (correctly) that getting the parsing right would take me a few tries. So rather than fetch each page dynamically and parse it, I decided to save the html pages to a more permanent location.
In addition, I wanted to include the date and title in the file names, but not separate by directory, so I did a bit of parsing of the parent url to generate the new string with the date included.
Finally, I also put in a short sleep()
to pause the loop so the site doesn't
think it's being DDOS'ed.
Not that ~200 requests is all that heavy a load, but it's an old site,
and I didn't really notice the difference,
since I could write what I did while I was waiting.
htmlout = joinpath(projectdir, "html_out")
isdir(htmlout) || mkdir(htmlout)
# this post wasn't available anymore, so I removed it
setdiff!(posturls, Set(["/webeasties/2010/12/26/weekend-review-all-about-the-g"]))
posturls
for url in posturls
m = match(r"^/webeasties/(\d{4}/\d{2}/\d{2})/([\w/\-]+)$", url)
isnothing(m) && error("url $url doesn't match")
dt, title = m.captures
dt = replace(dt, '/'=>"") # remove /, so eg 2013/01/25 becomes 20130125
file = "$(dt)_$title.html"
# skips files that already exist since I ran into some errors, I didn't want to re-do them
isfile(joinpath(htmlout, file)) || download("https://scienceblogs.com" * url, joinpath(htmlout, file))
sleep(0.1)
end
Alright, I have a bunch of html
files, what now?
Parsing posts
h/t: For this section, I got a bunch of inspiration from here.
There's a bunch of schmutz in these html documents used for ads, SEO, and linking around the site, none of which I want. So my task in this section is to pull out just the main post content, any other relavent info (like title etc), and put them into a markdown file.
Here, I'm using the EzXML.jl
package
and its XPath query ability
to parse and search my .html
files
(the HTML spec is just a flavor of XML).
A note on writing loops
When I'm going to do something in a loop like this, knowing it's going to take me a while to figure out exactly what to do, I'll often pull some examples first.
For example, knowing I have an array of html file paths,
and that I'm going to loop through them with for p in paths
,
the first thing I do is pull out just one example to work on.
using EzXML
paths = readdir(htmlout, join=true)
p = first(paths) # for p in paths
post = readhtml(p)
doc = root(post)
# ...
# end
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 44)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag header invalid from HTML parser (code: 801, line: 64)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 71)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag nav invalid from HTML parser (code: 801, line: 116)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 169)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 201)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 251)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 254)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 292)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 315)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 320)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag mark invalid from HTML parser (code: 801, line: 321)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 337)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 341)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 360)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag mark invalid from HTML parser (code: 801, line: 361)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 375)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 379)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 398)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag mark invalid from HTML parser (code: 801, line: 399)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 413)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 417)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 436)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag mark invalid from HTML parser (code: 801, line: 437)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 451)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 455)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 474)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag mark invalid from HTML parser (code: 801, line: 475)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 489)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 493)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 512)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag mark invalid from HTML parser (code: 801, line: 513)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 528)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 532)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 551)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag mark invalid from HTML parser (code: 801, line: 552)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 569)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 573)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 591)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag mark invalid from HTML parser (code: 801, line: 592)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 607)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 611)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 629)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag mark invalid from HTML parser (code: 801, line: 630)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 644)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 648)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 666)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag mark invalid from HTML parser (code: 801, line: 667)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 682)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 686)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag aside invalid from HTML parser (code: 801, line: 719)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 724)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 738)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 842)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 848)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 898)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 903)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 917)
└ @ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97
EzXML.Node()
This way, once I get it working on one instance,
I can just delete the first line, uncomment the for
loop bits,
and then run it on the whole thing.
I put in validation checks that throw errors if something violates my assumptions
(like the fact there should only be one content
and one title
node)
so that the loop will break and tell me where to look to fix my assumptions.
Getting stuff with XPath
My process here was not particularly fancy - I went to the first post on the web, copied the title and first couple of words of content, then went to the html file and searched for them. Happily, they appear in blocks that have unique selectors. Actually, the title is in two places:
<title>Why every "OMG we've cured cancer!!" article is about melanoma | ScienceBlogs</title>
<h1 class="page-header"><span>Why every "OMG we've cured cancer!!" article is about melanoma</span>
The <title>
one is easier to grab, so I just went with that.
XPath lets you look for any node named "title", wherever it is.
Just to be sure there really is only one node,
I also made sure to check the length of the returned value.
title = findall("//title", doc) # could have done `findfirst` instead, but then couldn't check if it's unique
length(title) != 1 && error("Expected only 1 title block, got $(length(title)) from $p")
title = first(title) # get it out of the array
title.content
"Ebola Outbreak in Uganda - Both More and Less Frightening Than You Think | ScienceBlogs"
The post itself is wrapped in <div class="content">
,
which we can also easily find with XPath:
body = findall("//div[@class=\"content\"]", doc)
length(body) != 1 && error("Expected only 1 content block, got $(length(body)) from $p")
body = first(body)
body.content
"\n \n In case you missed it, over the past couple of days there have been reports of an outbreak of Ebola hemorrhagic fever virus in Uganda. As of this writing, the most recent report I've seen puts the death toll at 16, with a few other suspected cases. Ebola is terrifying for a number of reasons - it's readily transmissible, it has a remarkably high and rapid lethality (25-90% case fatality rate within days to weeks), and the way it kills is gruesome - causing massive bleeding from all orifices. There's no vaccine or cure.\nThe good news from an epidemiology standpoint though, is that Ebola is a kinda terrible pathogen in humans. It has a short incubation time, and infected people are obvious (what with the bleeding from their eyes). Further, it's not airborne, so you actually have to come into contact with bodily fluids from infected individuals to become infected yourself. Once an outbreak has been identified, health workers can take fairly simple precautions and dramatically reduce their risk of infection. What this means in a practical sense is that, after the first terrifying days of lots of sick people turning up, people know what to do to avoid infection, and the outbreak burns itself out. Since its discovery in 1976, Ebola has killed less than 2000 people, with the worst outbreak in 2000 killing 224. Contrast that with your garden variety influenza, which kills on average 40,000 people per year in the US alone, and hundreds of thousands more around the globe.\nRemember, the goal of a pathogen is not to kill you. Your death is an unintended side-effect of the real goal: replication and transmission. On this front, Ebola sucks. It's great at evading your immune system and replicating like crazy, but then it makes you terrifying to other potential hosts and incapacitates or kills you before you have a chance to spread it very far. The most successful pathogens strike a balance between replication and transmission. Influenza generally leaves you healthy enough to walk around and shake hands, press the elevator button at work, and use an ATM. Influenza doesn't want you dead - the only reason it kills so many people is that so many more people become infected. Successful pathogens that make most of their victims really really sick or dead have more efficient ways to be transmitted that don't require you to be healthy and mobile. Malaria has mosquitoes do the work of getting the infection mobile, cholera gives folks massive diarrhea to spread itself in water sources, Bacillus anthracis (anthrax) can form spores that survive time, heat and desiccation for decades while waiting for a new host.\nSo, Ebola is not really as scary as it seems. The really scary thing about this outbreak and others is that it reminds us that brand new diseases come out of previously isolated areas, and that globalization and urbanization means that the distance between a rural village in Africa and every major city on the planet is small and shrinking. This most recent outbreak of Ebola had victims in Kampala - the capital of Uganda. It's fairly easy to imagine an international traveler hoping a plane from Kampala to Paris and then... Well, Ebola probably wouldn't get much further for the reasons I already mentioned, but some other virus?\nNew infectious diseases generally pass into humans from animals, but they're poorly adapted for our immune systems. Generally, this means that they either don't replicate very well, or are difficult to transmit, or they're super deadly and wipe out their hosts too quickly to be transmitted. The trouble with cities is that high population density lowers the bar on transmission, allowing more virulent bugs to make it to the big time. And rapid global transit means that any local outbreak has the potential to cause a global pandemic.\n\n \n \n Tags\n \n Immune system\n Microbes always win\n \n \nLog in to post comments\n "
One issue with using the content here is that, in html,
links are nested elements inside other elements.
In the body.content
, those links are stripped out.
For example, up above, I went to a seminar at TSRI that...
in the content
string
is actually I went to a seminar at <a href="http://www.scripps.edu/e_index.html">TSRI </a>that...
in the original html.
But, not to worry, those <a>
links are just additional nodes
in the original <div class="content">
node, so we can just find them with XPath.
The body
node can still search the whole tree, so we use .//
to search inside this node
links = findall(".//a[@href]", body)
length(links)
9
But there are only 10 links in the post.
It turns out there are some tags inside the body
node that have their own links.
So instead, I grab the first nested div
, and then search inside that.
post = findfirst("./div", body)
links = findall(".//a[@href]", post)
length(links)
6
post.content # this ends up being a little nicer too
"In case you missed it, over the past couple of days there have been reports of an outbreak of Ebola hemorrhagic fever virus in Uganda. As of this writing, the most recent report I've seen puts the death toll at 16, with a few other suspected cases. Ebola is terrifying for a number of reasons - it's readily transmissible, it has a remarkably high and rapid lethality (25-90% case fatality rate within days to weeks), and the way it kills is gruesome - causing massive bleeding from all orifices. There's no vaccine or cure.\nThe good news from an epidemiology standpoint though, is that Ebola is a kinda terrible pathogen in humans. It has a short incubation time, and infected people are obvious (what with the bleeding from their eyes). Further, it's not airborne, so you actually have to come into contact with bodily fluids from infected individuals to become infected yourself. Once an outbreak has been identified, health workers can take fairly simple precautions and dramatically reduce their risk of infection. What this means in a practical sense is that, after the first terrifying days of lots of sick people turning up, people know what to do to avoid infection, and the outbreak burns itself out. Since its discovery in 1976, Ebola has killed less than 2000 people, with the worst outbreak in 2000 killing 224. Contrast that with your garden variety influenza, which kills on average 40,000 people per year in the US alone, and hundreds of thousands more around the globe.\nRemember, the goal of a pathogen is not to kill you. Your death is an unintended side-effect of the real goal: replication and transmission. On this front, Ebola sucks. It's great at evading your immune system and replicating like crazy, but then it makes you terrifying to other potential hosts and incapacitates or kills you before you have a chance to spread it very far. The most successful pathogens strike a balance between replication and transmission. Influenza generally leaves you healthy enough to walk around and shake hands, press the elevator button at work, and use an ATM. Influenza doesn't want you dead - the only reason it kills so many people is that so many more people become infected. Successful pathogens that make most of their victims really really sick or dead have more efficient ways to be transmitted that don't require you to be healthy and mobile. Malaria has mosquitoes do the work of getting the infection mobile, cholera gives folks massive diarrhea to spread itself in water sources, Bacillus anthracis (anthrax) can form spores that survive time, heat and desiccation for decades while waiting for a new host.\nSo, Ebola is not really as scary as it seems. The really scary thing about this outbreak and others is that it reminds us that brand new diseases come out of previously isolated areas, and that globalization and urbanization means that the distance between a rural village in Africa and every major city on the planet is small and shrinking. This most recent outbreak of Ebola had victims in Kampala - the capital of Uganda. It's fairly easy to imagine an international traveler hoping a plane from Kampala to Paris and then... Well, Ebola probably wouldn't get much further for the reasons I already mentioned, but some other virus?\nNew infectious diseases generally pass into humans from animals, but they're poorly adapted for our immune systems. Generally, this means that they either don't replicate very well, or are difficult to transmit, or they're super deadly and wipe out their hosts too quickly to be transmitted. The trouble with cities is that high population density lowers the bar on transmission, allowing more virulent bugs to make it to the big time. And rapid global transit means that any local outbreak has the potential to cause a global pandemic.\n"
What about those tags? They're inside the body
node in a <div class="field--item">
node.
To get the tag, and do a bit of validation in case the organization is different,
I decided to parse the link inside that div
node.
That is, take something like
<div class="field--item"><a href="/tag/other-uses-immune-system" hreflang="en">Other uses of the immune system</a></div>
And get the link "/tag/other-uses-immune-system"
, so that I could make sure
it starts with /tag/
.
function gettags(body)
tags = String[]
tagnodes = findall(".//div[@class=\"field--item\"]", body)
for tagnode in tagnodes
a = findfirst("./a[@href]", tagnode)
link = first(attributes(a)).content
m = match(r"^/tag/([\w\-]+)$", link)
isnothing(m) && error("Expected tag, got $link")
push!(tags, m.captures[1])
end
return tags
end
gettags(body)
2-element Vector{String}:
"immune-system"
"microbes-always-win"
Conclusion
So now we have all the pieces, I'll save converting the posts to markdown for the next post.