Nick Cassady

My Portfolio
  • About
  • Home

Processing Wikipedia Pages, Part 1

  • Home
  • /
  • Programming
  • /
  • Global Timeline
  • /
  • Processing Wikipedia Pages, Part 1

Processing Wikipedia Pages, Part 1

By Nick

  • 4
  • Global Timeline ,
  • Tags: encoding, global timeline, python, scraping, scripting
  • 13 Mar

As part of a larger project (Global Timeline), I set out to write a script to scrape Wikipedia pages for events, dates, and locations for an initial set of data. There were several major steps involved, each bringing its own challenge.

Gathering Pages

Which Pages?

The first step I had to accomplish was to find a way to programmatically retrieve lists of events that I would parse into the format necessary for my needs. Originally, I thought to use Wikipedia’s Births and Deaths by year pages (e.g. https://en.wikipedia.org/wiki/Category:1904_births). This method would have added multiple unnecessary complications, such as including too much data (I don’t necessarily want to have every single individual with a Wikipedia page in my database) and extra parsing steps since I only have a name to go off of from this list. I then found Wikipedia has a summary page for each year that includes major events, births, deaths, and other information. Since these pages already organize the events in each category by date and only include the most important events and influential people, I decided to use this layout for my script.

Retrieval and Storage

To retrieve each page, I used pywikibot and created a short, single-use script:

for year in range(1900, 1999):
	site = pywikibot.Site()
	page = pywikibot.Page(site, str(year))

	f = codecs.open('../../pages/input/' + str(year) + '.txt', encoding='utf-8', mode='w+', errors='replace')
	f.write(page.text)
	f.close()

When I started testing this script, I ran into an issue where the encoding of the Wikipedia pages was causing issues when I would save that text to a file or display it in my console. To solve this, I decided to open the files with the matching utf-8 encoding and overwrite any errors that occur.

Now that I have the pages I want to scrape data from, I can begin manipulating the text into a format better suited for my needs.

Next Story

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Instagram

Load More...
Follow on Instagram

Search

    Recent Posts

  • Processing Wikipedia Pages, Part 2

    14 March 2018

  • Processing Wikipedia Pages, Part 1

    13 March 2018

Archives

  • March 2018

Categories

Tags

encoding global timeline python regex scraping scripting

Powered By Impressive Business WordPress Theme