Right-click on the folder DBE-A1 (described later) to zip the folder and submit the zip file to the Assignment 1 submission area in the course website. The submission will not be marked if it is not a zip or a zip that is password protected.
Zip files larger than 10MB will not be accepted. If you get a zip larger than 10MB, it means that the zip includes either your program output temp.xml or the input files all_pol.csv and all_tweets.csv. None of these files should be in the submission.
Aims of Assignment
This assignment enables you to apply your knowledge of XML, DTD, XQuery, and Relational Databases to an application. The objectives include
• Data modeling in XML
• Data Modeling in the relational database
• Conversion of data formats
• Understanding the modelling power of different models
• Handling large amount of data
Unzip the DBE-work1.zip file to a convenient location and this will give you the input files you need and the DBE-A1 folder for your solutions. Here is a note about the location. A good option for the location is a cloud storage like dropbox or google drive which makes sharing easy. At the same time, you may need to back up the folder periodically on your local disk to prevent the situation where one member deletes all files and the deletion is propagated to everyone (then everything is lost). Periodically you may also zip the folder and send it to you email as a backup. In the cases of laptop collapse or being stolen, or a dispute, the email backup can cover you.
In the DBE-A1 folder, you see some empty files as illustrated by the screenshot below. You will gradually put your answers in these files. Modify the people.txt file to put your group members’ email IDs and their main tasks. Each member must lead the work to at least one task (main task) and contribute to the most of its solution. Each task can have one and only one leading member. Members are encouraged to contribute to tasks lead by other students so that the final solutions are well completed, checked and discussed.
The file people.txt is important, especially when a dispute arise from group members. Generally, the marker will give all group members the same mark. However, when the work quality is significant inconsistent among the tasks or in the case of a dispute, group members may be given marks based on how well individual tasks are done.
Application and requirements
Twitter is a popular platform for people to express their opinions to the public. Millions of tweets are published each day on the platform. These tweets generate huge amount of data and carry large amount of information (e.g., topics, sentiment, opinion drifting patterns, etc.). People are interested to extract the information from the data. To do this, data transformation is a necessary step.
In this assignment, we conduct data transformation on a small sample (data set) of the twitter data. The data set is about US politicians and their tweets published previously and the data set was downloaded from http://www.cs.washington.edu/research/xmldatasets/www/repository.html
. The data set contains two csv files all_pol.csv and all_tweets.csv. The first one contains the information of politicians, and the second one contains the tweets they posted. These two files are in the DBE-work1 folder downloaded together with the assignment specification.
An introduction to csv files is given in the extraRecordings tab on the course website. By default, csv files use common ‘,’ as the delimiter between values. However, these two csv files use semi-colon ‘;’ as the delimiter. When you open them in Excel, you will need to use the menu function “data>text to columns” to divide data into different columns.
The assignment contains the following tasks and each task is worth 3 marks. The first task is the basis for other tasks and needs to be completed first.
Task 1: Choose data from all_tweets.csv and all_pol.csv and convert the chosen data to XML. Task 2: Design XML DTD for data merged from the XML documents.
Task 3: Write an XQuery query to join the XML documents.
Task 4: Design a relational database schema for the tweet data and comments on the two data models. Task 5: Automate Task1 for all tweets in python or java.
The detailed requirements of the tasks are given below.
Task1. Choose data from all_tweets.csv and all_pol.csv and convert the chosen data to XML Your group chooses 10 tweet entries from all_tweets.csv and 3 or more politician entries from all_pol.csv. Save the chosen tweets in the file tweets.csv and the chosen politicians in the pol.csv. The chosen tweets are expected to be from multiple politicians, and to have various missing columns. At least one chosen politician must have at least two chosen tweets.
A group with 2 people or less chooses 5 tweets and 3 politicians.
You convert the chosen tweets into a wellformed XML document manually and save the XML document in tweets.xml. You then convert the chosen politicians into another wellformed XML document manually and save the document in pol.xml.
Task2. Design XML DTD for the data merged from XML documents
It is not convenient for the information to be used if it is in two XML files. We aim to merge the data in the two XML files you produced in Task 1. For this purpose, we need a DTD for the merged data to contain each chosen politician and his/her chosen tweets in one element. You design this DTD. The DTD should be based on your chosen data, not too loose (every element is optional), not too tight (requiring every element to appear), and should not allow redundancy (repeating the politician’s information for every his/her tweet). The root element of the DTD must be politweets. The DTD should be stored in the file merge.dtd.
Task 3: Write an XQuery query to join the XML documents
Write an XQuery query to combine the contents of pol.xml and tweets.xml to produce one valid XML document conforming to the DTD merge.dtd you designed in Task 2. The merged document should be written to the file merge.xml. Your query should be saved in the file merge.xql.
Copy and insert the supplied header in header.txt to the front of the file merge.xml. Use Editix or another tool to validate merge.xml (against merge.dtd). If there is any error, you modify the query and try it again.
Task 4: Design a relational database schema for the data and show advantages and disadvantages By now you have known the data set very well. You design a relational database (RDB) schema by assuming that the data set will be stored in a RDB. Your schema should be presented in the following format: You then compare the XML model and the RDB schema and comment on the advantages and the disadvantages of the XML and the relational models. In your comments, you need to use relevant parts of your chosen data as examples to justify your points. That is, each point of your comment must be followed by an example using your own data. Justification by your own examples is important for the marker to see your understanding. If writing is not cohesive and the points are not backed up with examples, very few marks will be given.
Write your answer for this section in the file ‘sqlDBcomm.docx’. Poor presentation and bad writing will get mark deducted.
Task 5: Automate Step 1.
Write a program to automate your solution to Task 1 on the whole file all_tweets.csv. The program is either in python or in java. Your program will be tested in the following way in a DOS window: The input filename and its path “..\all_tweets.csv” must be read as a command line argument. The output (a wellformed XML document) must be displayed on the screen. Your program code must not use any hard-coded path and filename.
At the beginning, you test your program on the tweets.csv file you created. That is, you use tweets.xml as the command line argument instead of ..\all_tweets.csv. For the large input file ..\all_tweets.csv, you use the following syntax to direct the output on the screen to the file temp.xml where “>” is the output redirection operator to redirect screen output to the file. A recording on how to compile and run a java program from DOS command line is in the extraRecordings tab on the course website. A recording on how to run a python program from DOS command line is also given in the same tab.
In the case of using java, any special libraries used by your program should be included as jar files in the DBE-A1 folder together with your own code.
Sample code has been placed in ToXML.java and ToXML.py for demonstration. You can remove the code if you do not need it.
Marking will consider filenames, correctness of results, their formats, and quality of writing.
If you attempt only some tasks and your results are correct, you get marks for the parts. The check points in marking will be derived from the above requirements.