-
Notifications
You must be signed in to change notification settings - Fork 79
Expand file tree
/
Copy pathProblemStatement2.txt
More file actions
29 lines (23 loc) · 2.24 KB
/
Copy pathProblemStatement2.txt
File metadata and controls
29 lines (23 loc) · 2.24 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Part II
=======
The data set we're using is an anonymized Web server log file from a public relations company whose clients were DVD distributors. The log file is in the udacity_training/data directory, and it's currently compressed using GnuZip. So you'll need to decompress it and then put it in HDFS. If you take a look at the file, you'll see that each line represents a hit to the Web server. It includes the IP address which accessed the site, the date and time of the access, and the name of the page which was visited.
The logfile is in Common Log Format:
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lowpro.js HTTP/1.1" 200 10469
%h %l %u %t \"%r\" %>s %b
Where:
%h is the IP address of the client
%l is identity of the client, or "-" if it's unavailable
%u is username of the client, or "-" if it's unavailable
%t is the time that the server finished processing the request. The format is [day/month/year:hour:minute:second zone]
%r is the request line from the client is given (in double quotes). It contains the method, path, query-string, and protocol or the request.
%>s is the status code that the server sends back to the client. You will see see mostly status codes 200 (OK - The request has succeeded), 304 (Not Modified) and 404 (Not Found). See more information on status codes in W3C.org
%b is the size of the object returned to the client, in bytes. It will be "-" in case of status code 304.
For each of the problems, we would like you to write a MapReduce job to solve the problem and when you have done that you should be able to answer the question we are going to ask you.
Problems:
1. Write a MapReduce program which will display the number of hits for each different file on the Web site.
a. How many hits were made to the page: /assets/js/the-associates.js
2. Write a MapReduce program which determines the number of hits to the site made by each different IP Address.
a. How many hits were made by the IP address: 10.99.99.186
3. Find the most popular file on the Web site. In other words, the file which had the most hits. Your Reducer should just write out the name of the file and number of hits into HDFS.
a. Full path to the most popular file: --------
b. Number of hits to that file: --------