Assignment 3

Finding the most common word and letter in any language

How can you calculate the frequency of usage of different letters in English? What is the most commonly used word in Urdu?

In this assignment you will write a Python program which will, through its analysis of random but meaningful text in a language, tell you what the frequency of usage of different words and letters is for that language.

Our solution strategy is based on using Wikipedia articles as the source of random text in different languages. Here are the recommended steps to solve this problem:

  1. Access a number of random wikipedia articles in the language of your choice and obtain some text from them.
  2. After you have pooled sufficient text, you need to utilize your knowledge of Python loops,lists, strings, sets and dictionaries to find out the word and letter usage frequencies for that language.

Expected effort:

This assignment isn't hard but you will need to think about it. It will take about an hour of your time. The solution isn't more than 50 lines of code.

Input:

A language code ('en' for English, 'ur' for Urdu, etc.). Complete list of Wikipedia language codes is available here.

Required output:

A sorted list of the top 10 words and letters in at least two languages of your choice together with their relative frequency (frequency of occurrence divided by total number). Also submit your code as a single Python file.

Hints

Accessing Wikipedia Articles

You can install and use the Wikipedia API for Python to manage access to Wikipedia. Once properly installed, you can simply import wikipedia as wiki and start using it. Installation instructions are also given as hints in this assignment. All information related to wikipedia that you need for this assignment (and a little bit more!) is given in the code snippet below. You can also refer to the API's quickstart guide or documentation for more information. You can either use the content or summary of a page to get some text from that article.

In [6]:
import wikipedia as wiki

wiki.set_lang("ur") #Setting the language to Urdu
#All language codes available at: http://meta.wikimedia.org/wiki/List_of_Wikipedias

def getWikiPage(s):
    """
    This function returns the page associated with a given Wikipedia title string. 
    If there are multiple pages associated with a title string, it picks the
    first one in the disambiguation of that page.
    Input: s (Wikipedia page title)
    Return: Wikipedia Python API page object
    """
    try:
        p = wiki.page(s)
    except wiki.exceptions.DisambiguationError as disambiguation:
        #This exception is raised if there are multiple Wikipedia pages associated with the given title
        print disambiguation #We display the titles of all the pages
        print "Warning: Picking",disambiguation.options[0] #But we pick the first one
        s = disambiguation.options[0] #Like this!
        p = wiki.page(s)
    return p

s = 'کمپیوٹر پروگرام'  #Use wiki.random() to Pick a random title from Wikipedia.

print "The string for wikipedia is:",s 

try:
    p=getWikiPage(s)
except Exception as e: 
    #Just in case there are any unexpected errors
    print 'Failed to access the page associated with \'',s,'\'. The error returned is: ',e
else: #If everything goes well, print stuff!
    print '*'*10,"Title",p.title,'*'*10+'\n'
    print '*'*10,"id",p.pageid,'*'*10+'\n'
    print '*'*10,"Summary",'*'*10+'\n',p.summary    
    print '*'*10,"Content",'*'*10+'\n',p.content
    print '*'*10,"Links",'*'*10+'\n','\n'.join(p.links) #just for fun!
    
The string for wikipedia is: کمپیوٹر پروگرام
********** Title کمپیوٹر پروگرام **********

********** id 11843 **********

********** Summary **********
پروگرام یا برنامہ (program / programme) کی جاتی ہے۔ بنیادی طور پر پروگرام سے مراد کسی بھی کام یا ہدایات کو ترتیب سے جاری کرنے کی ہوتی ہے۔ کمپیوٹر پروگرام (computer program) ایک ایسے پروگرام یا برنامہ کو کہا جاتا ہے جو کمپیوٹر کو کام کرنے کے لیۓ ہدایات کی شکل میں فراہم کیا جاتا ہے۔ یہ ہدایات ایک ترتیب وار شکل میں ہوتی ہیں جن پر یکے بعد دیگرے عمل پیرا ہوکر ایک کمپیوٹر وہ کام انجام دیتا ہے جو اس سے مطلوب ہو، اسی مرحلہ بہ مرحلہ خصوصیات کی وجہ سے کمپیوٹر پروگرام کو ایک الخوارزمیہ (algorithm) بھی تصور کیا جاسکتا ہے۔
********** Content **********
پروگرام یا برنامہ (program / programme) کی جاتی ہے۔ بنیادی طور پر پروگرام سے مراد کسی بھی کام یا ہدایات کو ترتیب سے جاری کرنے کی ہوتی ہے۔ کمپیوٹر پروگرام (computer program) ایک ایسے پروگرام یا برنامہ کو کہا جاتا ہے جو کمپیوٹر کو کام کرنے کے لیۓ ہدایات کی شکل میں فراہم کیا جاتا ہے۔ یہ ہدایات ایک ترتیب وار شکل میں ہوتی ہیں جن پر یکے بعد دیگرے عمل پیرا ہوکر ایک کمپیوٹر وہ کام انجام دیتا ہے جو اس سے مطلوب ہو، اسی مرحلہ بہ مرحلہ خصوصیات کی وجہ سے کمپیوٹر پروگرام کو ایک الخوارزمیہ (algorithm) بھی تصور کیا جاسکتا ہے۔


== مزید دیکھیۓ ==
کمپیوٹر پروگرامنگ (computer programming)
********** Links **********
الخوارزمیہ
پروگرام
کمپیوٹر
کمپیوٹر پروگرامنگ

Practical considerations

For getting some meaningful results for this assignment, its good to:

  1. Start off by writing and rigorously testing a simple function that takes an input string (in English) and returns the number of times each word and each letter in the string is used as two Python dictionary objects. You will need this function for this assignment.
  2. Collect a large number of words (the more the better)
  3. Use a large number of articles
  4. Ignore small articles (say, those with less than 10 words in them)
  5. Remove punctuation, numbers etc.
  6. Ignore case sensitivity
  7. Seek help from books and the Internet on string and unicode processing in Python
  8. Not allow the same title to occur more than once as it may bias the results
  9. Run your code multiple times to get some consensus results
  10. Know that you are free to modify the proposed strategy or make an entirely new one as long as you get good practice of Python loops, strings, lists, sets, dictionaries and exception handling and it solves the problem.
  11. Use debugging techniques and take advantage of the scripting nature of Python to write good code fast.
  12. It's good to think about making the code efficient but only after you've made it functionally and structurally correct.

Also note that, you need to be connected to the Internet for wikipedia to work. It might be a little slow to get the pages so you may want to develop the functions that calculate word and letter frequency given some text offline using any text in English.

As long as you take a large amount of random text, you don't need to worry about noise (e.g., from other languages used in an article, etc.). If you like, you can also "filter out" characters in your language by using counting characters only in that language or you can simply choose to ignore any top results that don't make sense.

Should you choose to use content, be sure to ignore the section headings (which are surrounded by == on both sides). Also note that everything contained in summary is a part of content so use either one but not both so as not to bias your results.

You should not need to use any other modules.

Installation of Modules in Python

Python modules are typically shipped off as zipped archives. A huge repository of Python modules is available at the Python Packge Index PyPI. There are multiple ways of installing Python packages. The easiest way of installing wikipedia API is to follow these steps:

  1. Download the archive (from here or from the data server).
  2. Unzip/Untar the archive using your favorite program into a folder.
  3. There will be a file setup.py in that folder. Open the command prompt (or Terminal) from within Ananconda (Tools menu > Open Command Prompt) and cd to the folder and then type: python setup.py install.

That's all! If you need help with this, let me know. I would also recommend that you install pip to be able to install any package (usually without worrying about its dependencies) from PyPI. However, this isn't required for this assignment.

Possible Extensions (Optional and for extra-credit)

  1. An Urdu Caesar Cipher (We'll call it a Mughal Cipher!)
  2. Generate tag clouds for any given Wikipedia page.
  3. Wikipedia can be viewed as a Graph. Using the API to see if you can find a path connecting two Wikipedia pages. How can this be useful? One use that I can think of is that the length of the path will tell us how important one topic is to the other.
  4. Comparing Wikipedia pages to calculate similarity between them. Now, what's the use of this? One use is to employ the similarity metric to cluster wikipedia articles (google it!).
  5. If you have any other cool idea on these lines, come and talk to me!

(c) Dr. Fayyaz ul Amir Afsar Minhas, DCIS, PIEAS, Pakistan.