WEBVTT 0 00:03.360 --> 00:04.840 Welcome back. 1 00:04.860 --> 00:10.070 Let's say you have a log file that contains information about the Web site visits. 2 00:10.290 --> 00:17.210 In this lecture you're going to learn how to analyze or in other words how to parse a log file and generate 3 00:17.210 --> 00:24.440 a report first of all let me show you how the program is going to work. 4 00:24.450 --> 00:25.740 This is the log file. 5 00:25.830 --> 00:28.730 You can find it from the resources of this lecture. 6 00:29.040 --> 00:33.750 As you can see there are two fields per line separated by a space character. 7 00:33.870 --> 00:40.110 The first field is the domain name and the second field is the number of visits for that domain in the 8 00:40.110 --> 00:40.760 log file. 9 00:40.800 --> 00:42.750 The domains are duplicated. 10 00:42.750 --> 00:47.940 So you need to aggregate them to print the total visits per domain. 11 00:47.970 --> 00:54.810 Let me run the program by redirecting its standard input to the file like so as you can see it passes 12 00:54.810 --> 00:58.640 the file and it prints the total visits per domain. 13 00:58.650 --> 01:00.930 All right let's get started. 14 01:00.930 --> 01:06.050 First of all let me create a new scanner that uses the standard input. 15 01:06.090 --> 01:13.340 Let's scan the log file line by line by calling the scan method there are two fields per line. 16 01:13.550 --> 01:15.800 So now I'm going to split them. 17 01:15.800 --> 01:23.290 Remember the text method returns the current line and the fields function splits the line into the words. 18 01:23.300 --> 01:29.740 Now the fields variable contains a string slice. Let me print them so you can see how it works. 19 01:29.810 --> 01:30.320 I'm going to say. 20 01:30.330 --> 01:40.130 Domain dash visits I'm going to give it fields 0 and fields 1 the program correctly parses the log 21 01:40.130 --> 01:43.380 file to make the code easier to read. 22 01:43.430 --> 01:46.730 I'm going to put the fields in variables first. 23 01:46.760 --> 01:49.860 Let's declare a variable for the domain name. 24 01:49.970 --> 01:53.420 Let's convert the visits field to an integer. 25 01:53.420 --> 01:55.720 For now I've skipped the error handling. 26 01:55.730 --> 01:57.260 I'll do that later. 27 01:57.410 --> 02:00.680 You need to keep of the total total visits per domain. 28 02:00.830 --> 02:07.810 So I'm going to create a map variable like so it will store the domain names and the total visits per 29 02:07.810 --> 02:08.670 domain. 30 02:08.680 --> 02:12.810 I also need to initialize the map okay. 31 02:12.840 --> 02:16.420 Now it's time to calculate the total visits for the domains. 32 02:16.650 --> 02:22.080 First I'm going to get the total visits from the map and I'm going to add the current visits to the 33 02:22.080 --> 02:22.900 map. 34 02:23.010 --> 02:29.310 So if the domain name doesn't exist the map will create a new key for the domain and it will set it to 35 02:29.310 --> 02:31.140 the visits variable. 36 02:31.140 --> 02:35.280 Otherwise it will only increase the total visits for the domain. 37 02:35.280 --> 02:35.630 Okay. 38 02:35.700 --> 02:38.080 Now it's time to print the map first. 39 02:38.290 --> 02:44.090 Let's print the header for the domain names and total visits next. 40 02:44.120 --> 02:47.800 Let's draw a horizontal line forty five 41 02:47.810 --> 02:49.620 characters are fine. 42 02:49.730 --> 02:51.230 Now I'm going to range over the map. 43 02:51.230 --> 02:51.680 Like so 44 02:54.630 --> 03:00.760 and I'm going to print the total visits per domain it works. 45 03:00.820 --> 03:03.740 So it brings the total visits per domain. 46 03:04.310 --> 03:09.630 However when I run it a few times as you can see it prints the domain names randomly. 47 03:09.860 --> 03:15.160 As I explained in the last lecture before usually, you shouldn't loop over the maps. 48 03:15.230 --> 03:18.320 It's a tell-tale sign of a design mistake. 49 03:18.320 --> 03:19.490 So what should you do. 50 03:20.420 --> 03:23.360 I'm going to go to the top of the file here. 51 03:23.420 --> 03:29.740 I'm going to declare a slice to store the unique domain names instead of using the map for printing 52 03:29.740 --> 03:30.700 the domain names. 53 03:30.700 --> 03:32.510 I'm going to use this slice. 54 03:32.920 --> 03:34.720 Okay let's go down here. 55 03:34.750 --> 03:37.690 I'm going to check whether the domain is in the map or not. 56 03:37.690 --> 03:41.980 Like so the domain doesn't exist in the map. 57 03:42.010 --> 03:44.350 I'm going to add it to the slice. 58 03:44.350 --> 03:49.770 This way the slice will only contain the unique domain names here. 59 03:49.810 --> 03:57.560 Instead of looping over the map I'm going to loop over the domain slice instead I so for the total visits 60 03:57.680 --> 03:59.900 I'm going to get it from the map instead like so 61 04:03.320 --> 04:08.810 even though I run it several times it always prints the result in the same order. 62 04:08.870 --> 04:16.970 Awesome. Let's also sort the domain names using the sort package like so cool. 63 04:17.010 --> 04:17.420 Okay. 64 04:17.500 --> 04:21.050 Let me also print the total visits for every domain. 65 04:21.190 --> 04:25.970 To do that let's go to the top of the file here. 66 04:25.970 --> 04:28.470 I'm going to declare a variable like so. 67 04:28.880 --> 04:31.310 Now let's go down to the loop here. 68 04:31.310 --> 04:35.560 I'm going to add visits and total visits OK. 69 04:35.600 --> 04:38.060 Let's go down and let's printed here 70 04:42.710 --> 04:43.890 also. 71 04:43.910 --> 04:46.010 Now it's time to handle the errors. 72 04:46.460 --> 04:48.790 Let's go to the beginning of the loop. 73 04:48.920 --> 04:53.480 First I'm going to check whether the current line has two fields or not. 74 04:53.480 --> 04:56.390 If not I'm going to say wrong inputs like so 75 04:59.950 --> 05:01.940 OK let's test it. 76 05:02.080 --> 05:07.780 To do that I'm going to duplicate the log file here I'm going to remove the second field here to check 77 05:07.780 --> 05:14.480 whether I correctly handled the error or not like so. Let me say this file as log_err_missing.txt. 78 05:14.540 --> 05:15.560 whether I correctly handled the error or not like so. Let me say this file as log_err_missing.txt. 79 05:15.970 --> 05:19.400 Let's try it with the incorrect log file first. 80 05:19.480 --> 05:23.880 Cool the program detects the error okay. 81 05:23.880 --> 05:26.180 Now let's try something else. 82 05:26.180 --> 05:28.420 Let me duplicate the log file again. 83 05:28.880 --> 05:36.710 Let me change the value of the visits to a string this time I'm going to save it as log_err_str 84 05:39.490 --> 05:41.090 the program skips that line. 85 05:41.200 --> 05:43.330 So it ignores the error. 86 05:43.330 --> 05:45.340 This is not good. 87 05:45.340 --> 05:52.480 Let's handle the error like so. If there is an error I'm going to say wrong input then I'm going to 88 05:52.480 --> 05:59.950 terminate the program. Cool, it detects the error but not so fast. 89 06:00.020 --> 06:03.680 Let's see what happens if there is a negative visits count. 90 06:03.770 --> 06:05.990 Let me duplicate log file again. 91 06:06.170 --> 06:09.390 Here I'm going to change this to a negative number. 92 06:09.530 --> 06:17.450 This time I'm going to save it as log_err_negative. Visits to a domain should always be zero or 93 06:17.450 --> 06:18.780 a positive number. 94 06:18.860 --> 06:19.700 Right. 95 06:19.700 --> 06:23.060 So let's see what happens. 96 06:23.080 --> 06:27.260 It says minus forty four people have visited all the domains. 97 06:27.310 --> 06:30.430 Clearly I should handle this error as well. 98 06:30.430 --> 06:37.420 To do that I'm going to add another condition here like so call it works. 99 06:37.660 --> 06:43.000 Let's make the error message more descriptive by adding the error line as well. 100 06:43.060 --> 06:45.790 To do that let's go to the top here. 101 06:45.820 --> 06:48.790 I'm going to declare a new variable lines. 102 06:48.790 --> 06:51.150 Now let's go to the loop here. 103 06:51.160 --> 06:52.270 I'm going to increment it. 104 06:52.810 --> 06:55.760 OK let's add it to the error messages. 105 06:56.200 --> 06:59.430 First I'm going to print the error line for the first message. 106 06:59.470 --> 06:59.890 Like so 107 07:03.120 --> 07:06.750 Cool, now it tells you that the error is in the third line. 108 07:07.560 --> 07:09.500 So, why stop here? 109 07:09.510 --> 07:13.140 Let's also add it to the second error message as well like so 110 07:18.850 --> 07:19.380 excellent. 111 07:19.420 --> 07:20.750 It works. 112 07:20.810 --> 07:25.830 OK now let's handle the error from the scanner as well. 113 07:25.870 --> 07:28.940 Let me also print it. 114 07:28.970 --> 07:29.960 All right. 115 07:29.960 --> 07:35.550 That's all for now in the upcoming sections you will revisit this project several times. 116 07:35.810 --> 07:37.880 So, well done, congrats! 117 07:37.880 --> 07:39.050 See you in the next section.