WEBVTT 0 00:02.160 --> 00:03.620 Welcome back. 1 00:03.660 --> 00:06.420 A set contains only unique values. 2 00:06.420 --> 00:07.350 Just like a map. 3 00:07.620 --> 00:11.700 In this lecture you're going to learn how to use a map as a set. 4 00:11.730 --> 00:17.790 You're going to build a program that can answer whether a given word exists in a file or not. 5 00:17.790 --> 00:19.100 All right let's get started 6 00:23.210 --> 00:27.120 first of all let me show you the input file that you're going to use. 7 00:27.140 --> 00:31.300 This is a passage from Romeo and Juliet from Shakespeare. 8 00:31.310 --> 00:35.570 It contains 70 unique words and 99 words in total. 9 00:35.570 --> 00:41.600 So the map that you're going to create will store only 70 words because map keys are unique. 10 00:41.600 --> 00:47.110 OK now I'm going to create a new scanner that uses the standard input. 11 00:47.160 --> 00:53.250 How can you split the input by words? You can use strings dot split the function for each line. 12 00:53.280 --> 00:54.120 Right. 13 00:54.180 --> 01:00.990 However there is a more natural way you can configure the scanner to scan only for the words instead 14 01:00.990 --> 01:02.240 of lines. 15 01:02.400 --> 01:08.740 All you need to do is calling the split method and telling it to scan for the words like so. 16 01:08.760 --> 01:10.980 Let me show you how it works. 17 01:11.010 --> 01:16.470 First I need to scan for the words like so each time I call the scan method. 18 01:16.470 --> 01:19.710 It will scan for a word not a line. 19 01:19.710 --> 01:20.790 Let me show you. 20 01:21.030 --> 01:24.870 I'm going to print the scanned word by calling the text method like so 21 01:28.230 --> 01:33.520 as you can see it scans for the words instead of lines all right. 22 01:33.570 --> 01:40.530 Since you have a program that can scan for the words it is time to index the words using a map to do 23 01:40.530 --> 01:40.890 that. 24 01:40.890 --> 01:46.950 I'm going to create a new map like so the key type is a string and the element type is a bool. 25 01:47.250 --> 01:53.190 The keys will be the unique words the corresponding elements will tell you whether the word is in the 26 01:53.190 --> 01:54.900 map or not. 27 01:54.900 --> 02:01.170 So the user will be able to find out whether a word exists in the input or not. 28 02:01.170 --> 02:06.750 Now I'm going to remove the println and I'm going to get the word using the text method like so. 29 02:07.530 --> 02:14.740 Let me also change the word to lower case letters because a String map keys are case sensitive. 30 02:14.930 --> 02:17.660 I also don't want to index tiny words. 31 02:17.840 --> 02:23.020 So I'm going to index a word only if it has three letters at least. 32 02:23.080 --> 02:23.350 OK. 33 02:23.360 --> 02:30.680 Now it's time to index the words in the map like so but why do you need to set it to true. 34 02:30.920 --> 02:34.410 It's because this is what makes the map a set. 35 02:34.460 --> 02:35.950 I'll explain it in a minute. 36 02:35.960 --> 02:36.580 No worries. 37 02:37.550 --> 02:41.720 Let's say the user asks for the sun word. 38 02:41.900 --> 02:44.150 Let's check the word using the comma-ok. 39 02:44.390 --> 02:50.660 idiom like so if the word is in the map the ok variable will be true. 40 02:50.720 --> 02:56.660 So I'm going to say the input contains the word and I'm going to terminate the program using a return 41 02:56.660 --> 02:57.950 statement. 42 02:57.950 --> 03:01.880 Otherwise I'm gonna say sorry the input does not contain the word 43 03:05.600 --> 03:08.420 since there is a sun word in the input. 44 03:08.420 --> 03:11.220 It says the input contains sun. 45 03:11.360 --> 03:18.750 Let's see whether Shakespeare talks about gophers clearly it does not. 46 03:18.910 --> 03:21.250 I'm not surprised at this at all. 47 03:21.370 --> 03:23.800 However this code is not right. 48 03:23.980 --> 03:32.350 For example let me show you what the map returns when I ask for sun and Tesla as you can see sun returns 49 03:32.350 --> 03:35.350 true whereas Tesla returns false. 50 03:35.570 --> 03:40.850 It's because when the scanner returns the sun word I set it to true in the map. 51 03:40.900 --> 03:43.240 That's why it returns true here. 52 03:43.270 --> 03:50.410 However the map doesn't contain the Tesla word so it returns false because the zero value for bool is 53 03:50.530 --> 03:51.490 false. 54 03:51.490 --> 03:54.890 So I don't need to use the comma okay idiom here. 55 03:55.120 --> 04:02.680 I can directly check whether a word exists in the map or not like so. Let me also remove this code. As 56 04:02.680 --> 04:04.600 you can see it still works. 57 04:04.610 --> 04:09.480 Let's also try it with the sun word it works. 58 04:09.500 --> 04:12.580 Now it's time to get the word from the user. 59 04:12.590 --> 04:17.120 Now I'm going to fast forward it because you already know how it works. 60 04:17.180 --> 04:20.330 I'm going to remove the query variable from here. 61 04:20.430 --> 04:22.520 I'm going to get it from the user. 62 04:22.520 --> 04:32.260 All right let's try it as you can see it works although there are 99 words in the input file here. 63 04:32.270 --> 04:39.930 You only need to store 70 words because the map keys are unique and that's efficient by the way. 64 04:39.930 --> 04:41.760 You don't have to pipe a file. 65 04:41.880 --> 04:46.900 You can pipe anything but for now let's directly enter the input. 66 04:46.980 --> 04:49.290 I'm going to look for the beautiful word. 67 04:49.470 --> 04:57.030 Now I'm going to say today is a beautiful day as you can see it says that the input contains the beautiful 68 04:57.030 --> 04:59.200 word. Cool! 69 04:59.280 --> 05:04.130 You can even check whether a website contains a word that you are looking for. 70 05:04.260 --> 05:08.140 To do that you can pipe the curl command to your program. 71 05:08.300 --> 05:12.020 The Curl tool can retrieve the contents of a website. 72 05:12.180 --> 05:19.020 For example I can get a text document from the web like so "-s" argument here tells the curl that 73 05:19.020 --> 05:21.810 only to fetch the content. 74 05:21.810 --> 05:22.470 Normally 75 05:22.470 --> 05:30.570 It also shows us the download progress the s flag prevents that as you can see it retrieves the source 76 05:30.570 --> 05:32.580 code of the article. 77 05:32.580 --> 05:36.000 Now I can search for a word in it using our program. 78 05:36.210 --> 05:39.570 First I need to type the same command again. 79 05:39.570 --> 05:42.260 Then I need to pipe the content to our program. 80 05:42.300 --> 05:45.480 Like so I'm going to search for the outside word. 81 05:45.510 --> 05:46.030 Like so. 82 05:47.580 --> 05:49.710 As you can see the article contains the word. 83 05:50.430 --> 05:52.670 Let me show you the indexed words. 84 05:52.890 --> 06:01.460 I'm going to loop over the keys of the map then I'm going to print the word as you can see the program 85 06:01.500 --> 06:09.270 not only indexes the words it also indexes the numbers and punctuation characters or sometimes a word 86 06:09.420 --> 06:11.890 ends with a dot or a comma. 87 06:11.940 --> 06:18.870 So if you search for these words it won't be able to find them for example let me search for the inclusive 88 06:18.930 --> 06:23.290 word even though the document contains the inclusive word. 89 06:23.370 --> 06:25.000 The program couldn't find it. 90 06:25.010 --> 06:30.120 It's because the map is doing an exact match and the word answer with a dot. 91 06:30.270 --> 06:34.850 For example if I say inclusive that it can find the word. 92 06:34.860 --> 06:35.760 Now it works. 93 06:35.850 --> 06:37.290 But this is not good. 94 06:38.010 --> 06:42.140 So how can you remove the numbers and punctuation characters. 95 06:42.150 --> 06:48.210 Well there are a lot of ways but the simplest one for now is using a regular expression. 96 06:48.210 --> 06:51.360 This lecture is about maps and scanners as you know. 97 06:51.360 --> 06:55.050 So I'm not going to talk about the regular expressions in full detail. 98 06:55.090 --> 07:02.160 They're out of scope but put simply a regular expression allows us to find a pattern within a text 99 07:02.790 --> 07:04.770 to create a regular expression. 100 07:04.770 --> 07:08.470 You can use the regexp package. 101 07:08.700 --> 07:13.790 I'm going to create a pattern that matches any characters but letters. 102 07:14.330 --> 07:19.890 I've called the must compile function out of the loop because it's a costly operation. 103 07:19.890 --> 07:26.700 You only need to compile the regular expression pattern only once then you can use it as many times 104 07:26.820 --> 07:27.880 as you want. 105 07:27.990 --> 07:31.970 Compiling a pattern allows the matcher to work faster. 106 07:31.980 --> 07:32.240 Okay. 107 07:32.250 --> 07:34.250 Now let's see how it works. 108 07:34.380 --> 07:39.090 The must compile function takes a string and returns a matcher. 109 07:39.270 --> 07:46.070 So if we give it an incorrect pattern it will crash your program any function that starts with a must 110 07:46.080 --> 07:49.710 word does that it must work or it will crash. 111 07:49.710 --> 07:52.020 I know that my pattern is fine. 112 07:52.020 --> 07:55.450 So there is no harm in calling this function here. 113 07:55.470 --> 07:59.010 The square braces mean match to a character. 114 07:59.010 --> 08:05.760 The caret symbol means do not match. a to z means any character from a to z. 115 08:06.300 --> 08:10.740 And lastly the plus sign means match for the same pattern. 116 08:10.890 --> 08:12.420 One or more times. 117 08:12.900 --> 08:21.180 So in summary this pattern means find any character except letters one or more times the must compile 118 08:21.180 --> 08:27.860 function returns me a pattern matcher so I can use it to remove anything except letters. 119 08:27.900 --> 08:29.570 Let's use it down here. 120 08:29.580 --> 08:36.860 I'm going to call another function to replace anything but letters like so. You give it a string and 121 08:36.860 --> 08:40.020 you say search the pattern inside the word. 122 08:40.010 --> 08:43.300 And if you can find it replace it with an empty string. 123 08:43.400 --> 08:46.310 So this will remove anything but the letters. 124 08:46.310 --> 08:47.330 Let me show you. 125 08:47.450 --> 08:53.190 As you can see now there are only words without any numbers or punctuation characters. 126 08:53.260 --> 08:54.070 Awesome. 127 08:54.140 --> 09:00.380 For example the inclusive word doesn't contain dot anymore because I have removed it by using the regular 128 09:00.380 --> 09:01.050 expression. 129 09:01.730 --> 09:06.020 So now the program says that the document contains the inclusive word. 130 09:06.110 --> 09:06.800 Cool. 131 09:06.800 --> 09:13.250 These were a few examples to show you that using the buf io scanner you can get input from anywhere 132 09:13.640 --> 09:16.160 and you can use a map as a set. 133 09:16.200 --> 09:16.750 All right. 134 09:16.850 --> 09:19.740 That's how you can use a map as a set. 135 09:19.790 --> 09:20.790 That's all for now. 136 09:20.810 --> 09:21.830 See you in the next lecture.