How can you calculate the frequency of usage of different letters in English? What is the most commonly used word in Urdu?
In this assignment you will write a Python program which will, through its analysis of random but meaningful text in a language, tell you what the frequency of usage of different words and letters is for that language.
Our solution strategy is based on using Wikipedia articles as the source of random text in different languages. Here are the recommended steps to solve this problem:
This assignment isn't hard but you will need to think about it. It will take about an hour of your time. The solution isn't more than 50 lines of code.
A language code ('en' for English, 'ur' for Urdu, etc.). Complete list of Wikipedia language codes is available here.
A sorted list of the top 10 words and letters in at least two languages of your choice together with their relative frequency (frequency of occurrence divided by total number). Also submit your code as a single Python file.
You can install and use the Wikipedia API for Python to manage access to Wikipedia. Once properly installed, you can simply import wikipedia as wiki
and start using it. Installation instructions are also given as hints in this assignment. All information related to wikipedia
that you need for this assignment (and a little bit more!) is given in the code snippet below. You can also refer to the API's quickstart guide or documentation for more information. You can either use the content
or summary
of a page to get some text from that article.
import wikipedia as wiki
wiki.set_lang("ur") #Setting the language to Urdu
#All language codes available at: http://meta.wikimedia.org/wiki/List_of_Wikipedias
def getWikiPage(s):
"""
This function returns the page associated with a given Wikipedia title string.
If there are multiple pages associated with a title string, it picks the
first one in the disambiguation of that page.
Input: s (Wikipedia page title)
Return: Wikipedia Python API page object
"""
try:
p = wiki.page(s)
except wiki.exceptions.DisambiguationError as disambiguation:
#This exception is raised if there are multiple Wikipedia pages associated with the given title
print disambiguation #We display the titles of all the pages
print "Warning: Picking",disambiguation.options[0] #But we pick the first one
s = disambiguation.options[0] #Like this!
p = wiki.page(s)
return p
s = 'کمپیوٹر پروگرام' #Use wiki.random() to Pick a random title from Wikipedia.
print "The string for wikipedia is:",s
try:
p=getWikiPage(s)
except Exception as e:
#Just in case there are any unexpected errors
print 'Failed to access the page associated with \'',s,'\'. The error returned is: ',e
else: #If everything goes well, print stuff!
print '*'*10,"Title",p.title,'*'*10+'\n'
print '*'*10,"id",p.pageid,'*'*10+'\n'
print '*'*10,"Summary",'*'*10+'\n',p.summary
print '*'*10,"Content",'*'*10+'\n',p.content
print '*'*10,"Links",'*'*10+'\n','\n'.join(p.links) #just for fun!
For getting some meaningful results for this assignment, its good to:
Also note that, you need to be connected to the Internet for wikipedia
to work. It might be a little slow to get the pages so you may want to develop the functions that calculate word and letter frequency given some text offline using any text in English.
As long as you take a large amount of random text, you don't need to worry about noise (e.g., from other languages used in an article, etc.). If you like, you can also "filter out" characters in your language by using counting characters only in that language or you can simply choose to ignore any top results that don't make sense.
Should you choose to use content
, be sure to ignore the section headings (which are surrounded by == on both sides). Also note that everything contained in summary is a part of content so use either one but not both so as not to bias your results.
You should not need to use any other modules.
Python modules are typically shipped off as zipped archives. A huge repository of Python modules is available at the Python Packge Index PyPI. There are multiple ways of installing Python packages. The easiest way of installing wikipedia API
is to follow these steps:
setup.py
in that folder. Open the command prompt (or Terminal) from within Ananconda (Tools menu > Open Command Prompt) and cd
to the folder and then type: python setup.py install
.That's all! If you need help with this, let me know. I would also recommend that you install pip to be able to install any package (usually without worrying about its dependencies) from PyPI. However, this isn't required for this assignment.