Nick Cassady

My Portfolio
  • About
  • Home

Processing Wikipedia Pages, Part 2

  • Home
  • /
  • Programming
  • /
  • Global Timeline
  • /
  • Processing Wikipedia Pages, Part 2

Processing Wikipedia Pages, Part 2

By Nick

  • 401
  • Global Timeline ,
  • Tags: global timeline, python, regex, scraping, scripting
  • 14 Mar

Before reading this, I recommend you read part 1 here.

File Handling

Now that I have Wikipedia pages downloaded, I need to parse through the text and markup to get just the information I need for my other project. The first step in parsing through all these files was to open every file in a specified input directory. Due to the structure of my project files, I set up these directory paths:

inputPath = '../pages/input/'
inputBackupPath = '../pages/input-finished/'
outputPath = '../pages/output/'

Now I need to loop through all the files, process, then close them:

files = [f for f in listdir(inputPath) if isfile(join(inputPath, f))]
for file in files:
	outputFile = codecs.open(outputPath + file, encoding='utf-8', mode='w', errors='replace')
	contents = Path(inputPath + file).read_text()

        [...]

	outputFile.close()
	os.rename(inputPath + file, inputBackupPath + file)

Separating the sections

Looking through the raw Wikipedia files, I noticed there were 2 primary types of sections that I wanted to focus on. The first is called

== Events ==

and the others are

== Births ==

and

== Deaths ==

Planning ahead, I know I’m going to need to process the two different types of sections different, I go ahead and write a regex to grab each section as a whole and take note of which section I am working in:

def processSection(name, nextName):
	regexString = '== ?' + name + ' ?==(.*)== ?' + nextName + ' ?=='
	sectionText = re.search(regexString, contents, re.S)
	if sectionText:
		output = parseItems(sectionText, name)

Formatting each event

Ideally, after my script is done, I would like to have 1 event per line, with only the date, description, and coordinates, so that I can easily add these events to a database to be used by Global Timeline. Looking at the markdown inside each section, I see 2 different scenarios to look for. The first is when only one event, birth, or death is listed on a particular date this year:

* [[April 7]] – [[Mount Vesuvius]] erupts, and devastates [[Naples]].

Otherwise, if multiple events happened on the same day, it appears like this:

* [[February 4]]
** [[Dietrich Bonhoeffer]], German religious, resistance leader (d. [[1945]])
** [[Clyde Tombaugh]], American astronomer (d. [[1997]])

To handle these two scenarios, I decided to have my script process all of the single-date events first and save the multi-event dates for later.

def parseItems(itemList, sectionName):
	outputitems = ""
	unprocessedLines = []

	items = re.findall(r'\*(.*)', itemList.group())
	for item in items:
		itemGroups = re.search(r'^ ?\[\[(.*? \d{1,2})\]\] ?–(.*)', item)
		if itemGroups:
			date = itemGroups.group(1)
			eventDescription = removeLinks(itemGroups.group(2))

With this, I now have a date and description of the event. I also included a small helper function to remove the ugly double brackets denoting a link.

def removeLinks(line):
	return re.sub(r'[\[\[]|[\]\]]', '', line)

Now I just need coordinates to add to the event. The first scenario I have to deal with are general events. In these descriptions, there is often a location name within the text that I can use to find coordinates. To make this easier, I used the library GeoText. This library takes a string and returns the names of cities and countries it finds. Becuase strings will sometimes contain multiple locations, I decided to find coordinates for each place mentioned and save them all for later use.

if sectionName == 'Events':
	allPlaces = []
	location = []
	places = GeoText(eventDescription)
	if places.cities:
		allPlaces += places.cities
	if places.countries:
		allPlaces += places.countries

Now I need to turn these place names into coordinates and format those coordinates for my purposes. I did this by using another library called geopy, which I essentially used as a wrapper for Nominatim, which is a tool to search through OpenStreetMap data.

if allPlaces:
	locationString = '('
	for place in allPlaces:
		coordinates = getCoordinates(place)
		if coordinates not in locationString:
			locationString += '[' + coordinates + ']'
	locationString += ')'

Because geopy requires requests not be sent more than once per second, and requests users do not spam their API with repeated requests for the same location, I set up a local MongoDB server to store place names and associated coordinates that I check first before sending an API request. This also has the added benefit of not having to wait for a response for each set of coordinates.

def getCoordinates(placeName):
	geolocator = Nominatim()
	storedLocation = locations.find_one({'name': placeName})
	location = ['', '']

	if storedLocation:
		print("Location (%s) found in Mongo." % placeName)
		logFile.write("Location (%s) found in Mongo." % placeName)
		logFile.write("\n")
		location = storedLocation['coordinates'].split(',')
	elif placeName.strip():
		print("Location (%s) not found, sleeping for 1 second." % placeName)
		logFile.write("Location (%s) not found, sleeping for 1 second." % placeName)
		logFile.write("\n")
		time.sleep(1)
		geocodeLocation = geolocator.geocode(placeName)
		if geocodeLocation:
			location = [str(geocodeLocation.latitude), str(geocodeLocation.longitude)]
		else:
			return ''
		newLocation = {
			'name': placeName,
			'coordinates': location[0] + ', ' + location[1]
		}
		result = locations.insert_one(newLocation)

	return location[0] + ', ' + location[1]

Once all of the single-event dates are processed this way, I then parse through all of the multi-event dates. Getting the date and each event description is slightly different than the process above, but the rest remains largely unchanged.

date = ""
for line in unprocessedLines:
	dateObj = re.search(r'^ \[\[(.* \d{1,2})\]\]', line)
	if dateObj:
		date = dateObj.group(1)
	else:
		itemGroups = re.search(r'\*(.*)', line)

Getting birth and death locations

Because each line in the births and deaths sections only have the individuals name and a brief description, I need load their Wikipedia page like I did the year summaries and parse out the locations they were born and died in. The process is pretty similar to what I did with the individual events, but part of it does involve running another pywikibot script.

name = sys.argv[1].strip()
site = pywikibot.Site()
page = pywikibot.Page(site, name)

text = page.text
birthPlace = re.search(r'\| ?birth_place *?= ?\[?\[?(.*?)[\[\]\|]', text)
deathPlace = re.search(r'\| ?death_place *?= ?\[?\[?(.*?)[\[\]\|]', text)

if birthPlace or deathPlace:
	newPerson = {
		'name': '',
		'birthPlace': '',
		'deathPlace': ''
	}
	newPerson['name'] = name
	if birthPlace:
		newPerson['birthPlace'] = birthPlace.group(1)
	if deathPlace:
		newPerson['deathPlace'] = deathPlace.group(1)
	client = MongoClient()
	db = client.location_cache
	people = db.people

	result = people.insert_one(newPerson)
else:
	f = codecs.open('../pages/people/' + sys.argv[1].strip().replace(' ', '_') + '.txt', encoding='utf-8', mode='w+', errors='replace')
	f.write(page.text)
	f.close()

The process of finding and saving where a person was born and died was pretty straightforward at this point. I also added some extra code in the event I was unable to find the necessary information. When that is the case, I save the page to my machine, giving the file name the same name as the individual, for later reference to improve my script.

def getPersonBirthAndDeathCoordinates(name):
	storedPerson = people.find_one({'name': name})
	birthCoordinates = "[]"
	deathCoordinates = "[]"
	if storedPerson:
		birthCoordinates = getCoordinates(storedPerson['birthPlace'])
		deathCoordinates = getCoordinates(storedPerson['deathPlace'])
	elif not isfile(personPagePath + name.strip().replace(' ', '_') + '.txt'):
		call(['python3', 'pywikibot/pwb.py', 'myscript2', name])

		storedPerson = people.find_one({'name': name})
		if storedPerson:
			birthCoordinates = getCoordinates(storedPerson['birthPlace'])
			deathCoordinates = getCoordinates(storedPerson['deathPlace'])
		else:
			return ''
	return '([' + birthCoordinates + '][' + deathCoordinates + '])'

Here I check if an individual is already saved to MongoDB, and if not, run my pywikibot script and check again. If they are still not in the database, I do not save any coordinates and move on.

Saving the output

At the end of each section, I call a helper function to format the information I have pulled out and processed and write it to the output file.

outputitems += formatLine(date, eventDescription, locationString)
def formatLine(date, description, location=""):
	return (date + ' :;: ' + description.strip() + " :;: " + location +  "\n")

Once every section has finished, I close the output file and move the input year file to another folder so that I do not process it again if I have to stop my script or it crashes.

outputFile.close()
os.rename(inputPath + file, inputBackupPath + file)

Previous Story

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Instagram

Load More...
Follow on Instagram

Search

    Recent Posts

  • Processing Wikipedia Pages, Part 2

    14 March 2018

  • Processing Wikipedia Pages, Part 1

    13 March 2018

Archives

  • March 2018

Categories

Tags

encoding global timeline python regex scraping scripting

Powered By Impressive Business WordPress Theme