Before reading this, I recommend you read part 1 here.
Now that I have Wikipedia pages downloaded, I need to parse through the text and markup to get just the information I need for my other project. The first step in parsing through all these files was to open every file in a specified input directory. Due to the structure of my project files, I set up these directory paths:
inputPath = '../pages/input/' inputBackupPath = '../pages/input-finished/' outputPath = '../pages/output/'
Now I need to loop through all the files, process, then close them:
files = [f for f in listdir(inputPath) if isfile(join(inputPath, f))]
for file in files:
outputFile = codecs.open(outputPath + file, encoding='utf-8', mode='w', errors='replace')
contents = Path(inputPath + file).read_text()
[...]
outputFile.close()
os.rename(inputPath + file, inputBackupPath + file)
Looking through the raw Wikipedia files, I noticed there were 2 primary types of sections that I wanted to focus on. The first is called
== Events ==
and the others are
== Births ==
and
== Deaths ==
Planning ahead, I know I’m going to need to process the two different types of sections different, I go ahead and write a regex to grab each section as a whole and take note of which section I am working in:
def processSection(name, nextName): regexString = '== ?' + name + ' ?==(.*)== ?' + nextName + ' ?==' sectionText = re.search(regexString, contents, re.S) if sectionText: output = parseItems(sectionText, name)
Ideally, after my script is done, I would like to have 1 event per line, with only the date, description, and coordinates, so that I can easily add these events to a database to be used by Global Timeline. Looking at the markdown inside each section, I see 2 different scenarios to look for. The first is when only one event, birth, or death is listed on a particular date this year:
* [[April 7]] – [[Mount Vesuvius]] erupts, and devastates [[Naples]].
Otherwise, if multiple events happened on the same day, it appears like this:
* [[February 4]] ** [[Dietrich Bonhoeffer]], German religious, resistance leader (d. [[1945]]) ** [[Clyde Tombaugh]], American astronomer (d. [[1997]])
To handle these two scenarios, I decided to have my script process all of the single-date events first and save the multi-event dates for later.
def parseItems(itemList, sectionName):
outputitems = ""
unprocessedLines = []
items = re.findall(r'\*(.*)', itemList.group())
for item in items:
itemGroups = re.search(r'^ ?\[\[(.*? \d{1,2})\]\] ?–(.*)', item)
if itemGroups:
date = itemGroups.group(1)
eventDescription = removeLinks(itemGroups.group(2))
With this, I now have a date and description of the event. I also included a small helper function to remove the ugly double brackets denoting a link.
def removeLinks(line): return re.sub(r'[\[\[]|[\]\]]', '', line)
Now I just need coordinates to add to the event. The first scenario I have to deal with are general events. In these descriptions, there is often a location name within the text that I can use to find coordinates. To make this easier, I used the library GeoText. This library takes a string and returns the names of cities and countries it finds. Becuase strings will sometimes contain multiple locations, I decided to find coordinates for each place mentioned and save them all for later use.
if sectionName == 'Events': allPlaces = [] location = [] places = GeoText(eventDescription) if places.cities: allPlaces += places.cities if places.countries: allPlaces += places.countries
Now I need to turn these place names into coordinates and format those coordinates for my purposes. I did this by using another library called geopy, which I essentially used as a wrapper for Nominatim, which is a tool to search through OpenStreetMap data.
if allPlaces:
locationString = '('
for place in allPlaces:
coordinates = getCoordinates(place)
if coordinates not in locationString:
locationString += '[' + coordinates + ']'
locationString += ')'
Because geopy requires requests not be sent more than once per second, and requests users do not spam their API with repeated requests for the same location, I set up a local MongoDB server to store place names and associated coordinates that I check first before sending an API request. This also has the added benefit of not having to wait for a response for each set of coordinates.
def getCoordinates(placeName):
geolocator = Nominatim()
storedLocation = locations.find_one({'name': placeName})
location = ['', '']
if storedLocation:
print("Location (%s) found in Mongo." % placeName)
logFile.write("Location (%s) found in Mongo." % placeName)
logFile.write("\n")
location = storedLocation['coordinates'].split(',')
elif placeName.strip():
print("Location (%s) not found, sleeping for 1 second." % placeName)
logFile.write("Location (%s) not found, sleeping for 1 second." % placeName)
logFile.write("\n")
time.sleep(1)
geocodeLocation = geolocator.geocode(placeName)
if geocodeLocation:
location = [str(geocodeLocation.latitude), str(geocodeLocation.longitude)]
else:
return ''
newLocation = {
'name': placeName,
'coordinates': location[0] + ', ' + location[1]
}
result = locations.insert_one(newLocation)
return location[0] + ', ' + location[1]
Once all of the single-event dates are processed this way, I then parse through all of the multi-event dates. Getting the date and each event description is slightly different than the process above, but the rest remains largely unchanged.
date = ""
for line in unprocessedLines:
dateObj = re.search(r'^ \[\[(.* \d{1,2})\]\]', line)
if dateObj:
date = dateObj.group(1)
else:
itemGroups = re.search(r'\*(.*)', line)
Because each line in the births and deaths sections only have the individuals name and a brief description, I need load their Wikipedia page like I did the year summaries and parse out the locations they were born and died in. The process is pretty similar to what I did with the individual events, but part of it does involve running another pywikibot script.
name = sys.argv[1].strip()
site = pywikibot.Site()
page = pywikibot.Page(site, name)
text = page.text
birthPlace = re.search(r'\| ?birth_place *?= ?\[?\[?(.*?)[\[\]\|]', text)
deathPlace = re.search(r'\| ?death_place *?= ?\[?\[?(.*?)[\[\]\|]', text)
if birthPlace or deathPlace:
newPerson = {
'name': '',
'birthPlace': '',
'deathPlace': ''
}
newPerson['name'] = name
if birthPlace:
newPerson['birthPlace'] = birthPlace.group(1)
if deathPlace:
newPerson['deathPlace'] = deathPlace.group(1)
client = MongoClient()
db = client.location_cache
people = db.people
result = people.insert_one(newPerson)
else:
f = codecs.open('../pages/people/' + sys.argv[1].strip().replace(' ', '_') + '.txt', encoding='utf-8', mode='w+', errors='replace')
f.write(page.text)
f.close()
The process of finding and saving where a person was born and died was pretty straightforward at this point. I also added some extra code in the event I was unable to find the necessary information. When that is the case, I save the page to my machine, giving the file name the same name as the individual, for later reference to improve my script.
def getPersonBirthAndDeathCoordinates(name):
storedPerson = people.find_one({'name': name})
birthCoordinates = "[]"
deathCoordinates = "[]"
if storedPerson:
birthCoordinates = getCoordinates(storedPerson['birthPlace'])
deathCoordinates = getCoordinates(storedPerson['deathPlace'])
elif not isfile(personPagePath + name.strip().replace(' ', '_') + '.txt'):
call(['python3', 'pywikibot/pwb.py', 'myscript2', name])
storedPerson = people.find_one({'name': name})
if storedPerson:
birthCoordinates = getCoordinates(storedPerson['birthPlace'])
deathCoordinates = getCoordinates(storedPerson['deathPlace'])
else:
return ''
return '([' + birthCoordinates + '][' + deathCoordinates + '])'
Here I check if an individual is already saved to MongoDB, and if not, run my pywikibot script and check again. If they are still not in the database, I do not save any coordinates and move on.
At the end of each section, I call a helper function to format the information I have pulled out and processed and write it to the output file.
outputitems += formatLine(date, eventDescription, locationString)
def formatLine(date, description, location=""): return (date + ' :;: ' + description.strip() + " :;: " + location + "\n")
Once every section has finished, I close the output file and move the input year file to another folder so that I do not process it again if I have to stop my script or it crashes.
outputFile.close() os.rename(inputPath + file, inputBackupPath + file)
14 March 2018
13 March 2018
Powered By Impressive Business WordPress Theme