1 00:00:00,000 --> 00:00:00,000 2 00:00:00,000 --> 00:00:03,950 MALE SPEAKER: Thank you for coming, everybody. 3 00:00:03,950 --> 00:00:05,160 Some of you have probably already 4 00:00:05,160 --> 00:00:07,720 heard of Linus Torvalds. 5 00:00:07,719 --> 00:00:09,129 Those of you who haven't, you're the people with 6 00:00:09,130 --> 00:00:10,380 Macintoshes on your laps. 7 00:00:10,380 --> 00:00:12,710 8 00:00:12,710 --> 00:00:16,370 He's a guy who delights in being cruel to people. 9 00:00:16,370 --> 00:00:18,570 His latest cruel act is to create a revision control 10 00:00:18,570 --> 00:00:21,269 system which is expressly designed to make you feel less 11 00:00:21,269 --> 00:00:25,199 intelligent than you thought you were. 12 00:00:25,199 --> 00:00:27,500 Thank you for coming down today, Linus. 13 00:00:27,500 --> 00:00:29,600 I've been getting emails for the past few days from people 14 00:00:29,600 --> 00:00:30,540 saying, where's Linus? 15 00:00:30,539 --> 00:00:31,960 Why hasn't he measured my tree? 16 00:00:31,960 --> 00:00:33,530 Doesn't he love me anymore? 17 00:00:33,530 --> 00:00:34,750 And he walked into my office this afternoon. 18 00:00:34,750 --> 00:00:36,609 What are you doing here? 19 00:00:36,609 --> 00:00:38,609 But thank you for taking the time off. 20 00:00:38,609 --> 00:00:41,594 So Linus is here today to explain to us why on Earth he 21 00:00:41,594 --> 00:00:45,030 would write a software tool which only he is smart enough 22 00:00:45,030 --> 00:00:45,880 to know how to use. 23 00:00:45,880 --> 00:00:47,130 Thanks, Linus. 24 00:00:47,130 --> 00:00:56,835 25 00:00:56,835 --> 00:01:00,609 LINUS TORVALDS: So I have a few words of warning, which is 26 00:01:00,609 --> 00:01:03,420 I don't actually do speaking very much, partly because I 27 00:01:03,420 --> 00:01:05,730 don't like speaking, partly because over the last few 28 00:01:05,730 --> 00:01:09,859 years everybody actually wants me to talk about nebulous 29 00:01:09,859 --> 00:01:13,209 visions for the next century about Linux. 30 00:01:13,209 --> 00:01:16,989 And I'm a tech geek, so I actually prefer talking about 31 00:01:16,989 --> 00:01:18,269 technology. 32 00:01:18,269 --> 00:01:21,149 So that's why I am not talking about the kernel, because it's 33 00:01:21,150 --> 00:01:24,380 just too big to cram into a one-hour talk. 34 00:01:24,379 --> 00:01:29,909 Although apparently, Andrew did that two days ago. 35 00:01:29,909 --> 00:01:32,450 And I'm instead talking about Git, which is the source 36 00:01:32,450 --> 00:01:34,859 control management system that we use for the kernel. 37 00:01:34,859 --> 00:01:37,790 38 00:01:37,790 --> 00:01:44,010 I'm really, really, really bad at doing slides, which means 39 00:01:44,010 --> 00:01:47,400 that if we actually end up following these slides, you 40 00:01:47,400 --> 00:01:51,150 will be bored out of your mind and the talk will probably not 41 00:01:51,150 --> 00:01:53,560 be very good anyway. 42 00:01:53,560 --> 00:01:57,010 So I am the kind of speaker who really 43 00:01:57,010 --> 00:01:58,760 enjoys getting questions. 44 00:01:58,760 --> 00:02:03,330 And if that means that we kind of veer off in a tangent, 45 00:02:03,329 --> 00:02:05,429 you'll be happier, I'll be happier, the talk will 46 00:02:05,430 --> 00:02:07,510 probably be more interesting anyway. 47 00:02:07,510 --> 00:02:10,620 I don't know how you do things here at the Google talks, but 48 00:02:10,620 --> 00:02:14,050 I'm just saying don't feel shy as far as I'm concerned. 49 00:02:14,050 --> 00:02:18,510 If your manager will shoot you, that's your problem. 50 00:02:18,509 --> 00:02:22,310 So next slide. 51 00:02:22,310 --> 00:02:26,080 I want to give a few credits before I start. 52 00:02:26,080 --> 00:02:32,960 Credit CVS in a very, very negative way because in many 53 00:02:32,960 --> 00:02:38,310 ways when I designed Git, it's the what would Jesus do? 54 00:02:38,310 --> 00:02:43,009 Except it's what would CVS never, ever do kind of 55 00:02:43,009 --> 00:02:46,590 approach to source control management. 56 00:02:46,590 --> 00:02:50,020 I've never actually used CVS for the kernel. 57 00:02:50,020 --> 00:02:53,680 For the first 10 years of kernel maintenance, we 58 00:02:53,680 --> 00:02:57,260 literally used tarballs and patches, which is a much 59 00:02:57,259 --> 00:03:02,840 superior source control management system than CVS is. 60 00:03:02,840 --> 00:03:06,009 But I did end up using CVS for seven years at a commercial 61 00:03:06,009 --> 00:03:10,169 company and I hated it with a passion. 62 00:03:10,169 --> 00:03:13,759 When I say I hate CVS with a passion, I have to also say 63 00:03:13,759 --> 00:03:17,759 that if there are any SVN users in Subversion, users in 64 00:03:17,759 --> 00:03:24,519 the audience, you might want to leave because my hatred of 65 00:03:24,520 --> 00:03:29,120 CVS has meant that I see Subversion as being the most 66 00:03:29,120 --> 00:03:34,800 pointless project ever started, because the slogan 67 00:03:34,800 --> 00:03:39,469 for Subversion for a while was, CVS done right or 68 00:03:39,469 --> 00:03:42,060 something like that. 69 00:03:42,060 --> 00:03:45,789 And if you start with that kind of slogan, there's 70 00:03:45,789 --> 00:03:48,919 nowhere you can go. 71 00:03:48,919 --> 00:03:51,959 There is no way to do CVS right. 72 00:03:51,960 --> 00:03:55,349 So that's the negative kind of credit. 73 00:03:55,349 --> 00:03:57,969 The positive credit is BitKeeper. 74 00:03:57,969 --> 00:04:00,900 And I realize that a lot of people thought there was a lot 75 00:04:00,900 --> 00:04:04,539 of strife over BitKeeper and that the parting was very 76 00:04:04,539 --> 00:04:06,120 painful in many ways. 77 00:04:06,120 --> 00:04:12,200 As far as I'm concerned, the parting was amicable, even 78 00:04:12,199 --> 00:04:16,889 though it looked very non-amical to outsiders. 79 00:04:16,889 --> 00:04:23,409 And BitKeeper was not only the first source control system 80 00:04:23,410 --> 00:04:27,820 that I ever felt was worth using at all, it was also the 81 00:04:27,819 --> 00:04:30,769 source control system that taught me why there's a point 82 00:04:30,769 --> 00:04:33,779 to them and how you actually can do things. 83 00:04:33,779 --> 00:04:37,199 So Git in many ways, even though from a technical angle 84 00:04:37,199 --> 00:04:40,490 it is very, very different from BitKeeper, which was 85 00:04:40,490 --> 00:04:43,490 another design goal because I wanted to make it clear that 86 00:04:43,490 --> 00:04:48,240 it wasn't a BitKeeper clone, a lot of the flows we use with 87 00:04:48,240 --> 00:04:52,629 Git come directly from the flows we 88 00:04:52,629 --> 00:04:55,550 learned from BitKeeper. 89 00:04:55,550 --> 00:05:01,319 And I don't think you use BitKeeper here inside Google. 90 00:05:01,319 --> 00:05:05,139 As far as I know, BitKeeper is the only commercial source 91 00:05:05,139 --> 00:05:08,629 control management system that actually does distribution. 92 00:05:08,629 --> 00:05:11,810 And if you need a commercial run, that's the one you should 93 00:05:11,810 --> 00:05:13,910 use, for that reason. 94 00:05:13,910 --> 00:05:18,470 I'd also like to point out that I've been doing Git now 95 00:05:18,470 --> 00:05:24,640 for slightly over two years, but while I started it and I 96 00:05:24,639 --> 00:05:27,990 made all the initial coding design, it's actually being 97 00:05:27,990 --> 00:05:32,310 maintained by a much more pleasant person, Junior 98 00:05:32,310 --> 00:05:34,259 Hermano, for the last year and a half. 99 00:05:34,259 --> 00:05:39,529 And he's really the person who actually made it more 100 00:05:39,529 --> 00:05:42,359 approachable for mere mortals. 101 00:05:42,360 --> 00:05:46,060 Early versions of Git did require a certain amount of 102 00:05:46,060 --> 00:05:50,209 brainpower to really wrap your mind around. 103 00:05:50,209 --> 00:05:53,500 It's gotten much, much easier since. 104 00:05:53,500 --> 00:05:57,439 Obviously the way I always do everything is I try to get 105 00:05:57,439 --> 00:05:59,910 everybody else to do as much as possible so that I can sit 106 00:05:59,910 --> 00:06:03,160 back and sip my pina colada, so there's been a lot of other 107 00:06:03,160 --> 00:06:04,860 people involved, too. 108 00:06:04,860 --> 00:06:06,270 That's the credits. 109 00:06:06,269 --> 00:06:07,680 With those out of the way. 110 00:06:07,680 --> 00:06:10,660 111 00:06:10,660 --> 00:06:17,010 So this slide is now one day old, and I didn't actually do 112 00:06:17,009 --> 00:06:20,550 the slides last night because last night I was out carousing 113 00:06:20,550 --> 00:06:22,360 and eating sushi. 114 00:06:22,360 --> 00:06:29,370 But the slides will talk about implementation of a high 115 00:06:29,370 --> 00:06:31,899 performance distributed content management thing. 116 00:06:31,899 --> 00:06:35,319 And the keyword here is actually the distributed part. 117 00:06:35,319 --> 00:06:37,959 I will start off trying to explain why 118 00:06:37,959 --> 00:06:40,159 distribution is so important. 119 00:06:40,160 --> 00:06:42,530 If we never get past that point, I 120 00:06:42,529 --> 00:06:44,529 will actually be happy. 121 00:06:44,529 --> 00:06:47,549 If we never get to actually what Git implementation 122 00:06:47,550 --> 00:06:51,079 internally is, it's fine. 123 00:06:51,079 --> 00:06:54,490 I am not also trying to teach you how to use Git. 124 00:06:54,490 --> 00:06:57,600 There is this thing called google.com. 125 00:06:57,600 --> 00:07:00,640 126 00:07:00,639 --> 00:07:02,029 You may have seen it. 127 00:07:02,029 --> 00:07:04,639 It has this thing you can type things into. 128 00:07:04,639 --> 00:07:07,719 You type Git and then you press the I'm Feeling Lucky 129 00:07:07,720 --> 00:07:11,450 button, and you will actually get the home page. 130 00:07:11,449 --> 00:07:17,259 The home page has tutorials, it has the user manual, 131 00:07:17,259 --> 00:07:20,459 they're all in HTML. 132 00:07:20,459 --> 00:07:23,979 If you actually want to learn to use Git, that's where you 133 00:07:23,980 --> 00:07:26,530 should start, not at this talk. 134 00:07:26,529 --> 00:07:30,109 But as mentioned, if we actually start veering off 135 00:07:30,110 --> 00:07:31,939 topic into other tangents because of 136 00:07:31,939 --> 00:07:36,199 questions, it's all good. 137 00:07:36,199 --> 00:07:40,779 I already gave you kind of a heads up warning on this. 138 00:07:40,779 --> 00:07:45,799 I use the SCM, which I consider to mean Source Code 139 00:07:45,800 --> 00:07:48,750 Management, that is, revision control. 140 00:07:48,750 --> 00:07:52,810 Some other people think SCM means Software Configuration 141 00:07:52,810 --> 00:07:56,370 Management and see it as a much bigger feature, including 142 00:07:56,370 --> 00:07:58,280 release management and stuff like that. 143 00:07:58,279 --> 00:08:01,079 That's not what I'm talking about, although Git is clearly 144 00:08:01,079 --> 00:08:04,329 relevant in that setting, too. 145 00:08:04,329 --> 00:08:07,019 CVS, we already went there. 146 00:08:07,019 --> 00:08:11,689 You can disagree with me as much as you want, but during 147 00:08:11,689 --> 00:08:15,829 this talk, by definition anybody who disagrees is 148 00:08:15,829 --> 00:08:18,055 stupid and ugly. 149 00:08:18,055 --> 00:08:20,319 So keep that in mind. 150 00:08:20,319 --> 00:08:23,469 When I'm done speaking, you can go on with their lives. 151 00:08:23,470 --> 00:08:26,070 Right now, yes. 152 00:08:26,069 --> 00:08:32,399 I have strong opinions and CVS users, if you actually like 153 00:08:32,399 --> 00:08:36,319 using CVS, you shouldn't be here. 154 00:08:36,320 --> 00:08:41,320 You should be in some mental institution somewhere else. 155 00:08:41,320 --> 00:08:44,550 156 00:08:44,549 --> 00:08:47,039 So before actually go and talk about the whole distribution 157 00:08:47,039 --> 00:08:50,009 thing, which I think is the most important part, I'll talk 158 00:08:50,009 --> 00:08:53,169 a bit about the background because it invariably comes up 159 00:08:53,169 --> 00:08:57,259 because people, if they have heard about Git, a lot of the 160 00:08:57,259 --> 00:09:00,669 things they've heard about is the background for doing it in 161 00:09:00,669 --> 00:09:01,889 the first place. 162 00:09:01,889 --> 00:09:06,410 One piece of background information is I really am not 163 00:09:06,410 --> 00:09:07,849 an SCM person. 164 00:09:07,850 --> 00:09:09,509 I have never been very 165 00:09:09,509 --> 00:09:10,919 interested in revision control. 166 00:09:10,919 --> 00:09:14,729 I thought it was evil until I met BitKeeper. 167 00:09:14,730 --> 00:09:18,100 I actually credit that to some degree for why Git is so much 168 00:09:18,100 --> 00:09:19,750 better than everything else. 169 00:09:19,750 --> 00:09:22,840 It's because my brain did not rot from years and years of 170 00:09:22,840 --> 00:09:26,259 thinking CVS did something sane. 171 00:09:26,259 --> 00:09:28,000 I needed a replacement for BitKeeper. 172 00:09:28,000 --> 00:09:30,549 The reason for that was BitKeeper is a commercial 173 00:09:30,549 --> 00:09:34,500 product, but BitMover and Larry McVoy allowed it to be 174 00:09:34,500 --> 00:09:36,570 used freely for open source projects, as 175 00:09:36,570 --> 00:09:38,710 some of you may know. 176 00:09:38,710 --> 00:09:41,860 The only restriction was you were not supposed to reverse 177 00:09:41,860 --> 00:09:44,990 engineer it and you weren't supposed to try to create a 178 00:09:44,990 --> 00:09:46,980 competing product. 179 00:09:46,980 --> 00:09:49,039 And I was happy with that because, quite frankly, as far 180 00:09:49,039 --> 00:09:52,110 as I'm concerned I do open source because I think it's 181 00:09:52,110 --> 00:09:54,009 the only right way to do software. 182 00:09:54,009 --> 00:09:57,740 But at the same time, I'll use the best tool for the job and, 183 00:09:57,740 --> 00:10:00,730 quite frankly, BitKeeper was it. 184 00:10:00,730 --> 00:10:03,779 However, not everybody agreed with me. 185 00:10:03,779 --> 00:10:05,490 They are ugly and stupid. 186 00:10:05,490 --> 00:10:11,789 But they cause problems and it resulted in the fact that 187 00:10:11,789 --> 00:10:15,169 Larry and I had several telephone conversations which 188 00:10:15,169 --> 00:10:20,129 ended up saying we'll all be much happier if we just part 189 00:10:20,129 --> 00:10:22,629 ways and don't make this any worse. 190 00:10:22,629 --> 00:10:24,189 So we did. 191 00:10:24,190 --> 00:10:30,610 And I made the Linux 2.6.12-rc2 release about two 192 00:10:30,610 --> 00:10:34,399 years ago and said, I'm not going to touch Linux until I 193 00:10:34,399 --> 00:10:40,139 have a replacement for BitKeeper for doing source 194 00:10:40,139 --> 00:10:41,250 code maintenance. 195 00:10:41,250 --> 00:10:43,990 And one of the replacement options was going back to 196 00:10:43,990 --> 00:10:45,789 tarballs and patches, but nobody 197 00:10:45,789 --> 00:10:48,399 really liked that anymore. 198 00:10:48,399 --> 00:10:53,370 So I actually looked at a lot of alternatives. 199 00:10:53,370 --> 00:10:58,600 Most of them I could discard without even trying them out. 200 00:10:58,600 --> 00:11:01,330 If you're not distributed, you're not worth using. 201 00:11:01,330 --> 00:11:02,340 It's that simple. 202 00:11:02,340 --> 00:11:05,060 If you perform badly, you're not worth using. 203 00:11:05,059 --> 00:11:06,619 It's that simple. 204 00:11:06,620 --> 00:11:10,529 And if you cannot guarantee that the stuff I put into an 205 00:11:10,529 --> 00:11:16,059 SCM comes out exactly the same, you're not worth using. 206 00:11:16,059 --> 00:11:19,099 Quite frankly, that pretty much took care of 207 00:11:19,100 --> 00:11:20,759 everything out there. 208 00:11:20,759 --> 00:11:24,120 There's a lot of SCM systems that do not guarantee that 209 00:11:24,120 --> 00:11:28,409 what you get out of it again is the same thing you put in. 210 00:11:28,409 --> 00:11:33,370 If you have memory corruption, if you have disk corruption, 211 00:11:33,370 --> 00:11:34,789 you may never know. 212 00:11:34,789 --> 00:11:38,699 The only way you'll know is you notice that there's 213 00:11:38,700 --> 00:11:42,360 corruption in the files when you check them out. 214 00:11:42,360 --> 00:11:45,800 The source control management system does not protect you at 215 00:11:45,799 --> 00:11:48,409 all, and this is not even uncommon. 216 00:11:48,409 --> 00:11:51,549 It is very, very common. 217 00:11:51,549 --> 00:11:55,839 The performance issue, one of the things I kind of liked was 218 00:11:55,840 --> 00:11:59,769 a system called monotone, which actually, I think there 219 00:11:59,769 --> 00:12:02,259 was a talk at Google about them some time 220 00:12:02,259 --> 00:12:05,200 ago, I'm not sure. 221 00:12:05,200 --> 00:12:08,170 It had a lot of interesting ideas, but performance was so 222 00:12:08,169 --> 00:12:14,389 horrendously bad that I tried it for a day and realized that 223 00:12:14,389 --> 00:12:16,230 I cannot use it. 224 00:12:16,230 --> 00:12:19,690 The end result was I decided I can write something better 225 00:12:19,690 --> 00:12:23,700 than anything out there in two weeks, and I was right. 226 00:12:23,700 --> 00:12:28,730 So now we get to distribution. 227 00:12:28,730 --> 00:12:31,279 And this is the worst slide of them all, and I'm not very 228 00:12:31,279 --> 00:12:32,409 proud of it. 229 00:12:32,409 --> 00:12:34,549 And the problem is distribution is really, really 230 00:12:34,549 --> 00:12:37,289 important, but when I tried to make slides about it I 231 00:12:37,289 --> 00:12:38,759 could not do it. 232 00:12:38,759 --> 00:12:42,689 And part of it is my obvious artistic talents, which are on 233 00:12:42,690 --> 00:12:45,660 display for all of you, but part of it is that it's really 234 00:12:45,659 --> 00:12:48,399 hard to explain. 235 00:12:48,399 --> 00:12:51,809 So before you can start, I'd like to know how many people 236 00:12:51,809 --> 00:12:55,709 are used to the notion of a truly distributed source 237 00:12:55,710 --> 00:12:56,990 control management system? 238 00:12:56,990 --> 00:13:01,460 239 00:13:01,460 --> 00:13:04,300 Are most of you kernel developers? 240 00:13:04,299 --> 00:13:05,469 No, OK. 241 00:13:05,470 --> 00:13:09,080 So there were maybe 10 hands coming up. 242 00:13:09,080 --> 00:13:15,920 Being distributed very much means that you do not have one 243 00:13:15,919 --> 00:13:20,009 central location that keeps track of your data. 244 00:13:20,009 --> 00:13:23,799 No single place is more important than any other 245 00:13:23,799 --> 00:13:25,309 single place. 246 00:13:25,309 --> 00:13:28,239 So for example, this is why I would never touch Subversion 247 00:13:28,240 --> 00:13:29,940 with a 10 foot pole. 248 00:13:29,940 --> 00:13:34,240 There is a massive Subversion repository, and it's where 249 00:13:34,240 --> 00:13:38,039 everybody has to write. 250 00:13:38,039 --> 00:13:41,539 The centralized model just doesn't work when you want to 251 00:13:41,539 --> 00:13:47,269 be-- let's look at a few of the cases. 252 00:13:47,269 --> 00:13:52,189 I say it's so much more than just offline work, but the 253 00:13:52,190 --> 00:13:56,040 offline work part is actually maybe the most obvious thing, 254 00:13:56,039 --> 00:13:59,819 which is that you can take a truly distributed source 255 00:13:59,820 --> 00:14:02,530 control management system, you can take it on a plane and 256 00:14:02,529 --> 00:14:06,959 even if they don't offer Wi-Fi and satellite hookups, you 257 00:14:06,960 --> 00:14:09,940 just continue working, you can look at all your logs, you can 258 00:14:09,940 --> 00:14:12,680 commit, you can do everything you would do even if you were 259 00:14:12,679 --> 00:14:15,429 connected to a nice gigabit ethernet 260 00:14:15,429 --> 00:14:17,819 directly to the backbone. 261 00:14:17,820 --> 00:14:19,830 And that is really important. 262 00:14:19,830 --> 00:14:24,610 It is doubly important when you have hundreds or thousands 263 00:14:24,610 --> 00:14:30,060 of people working on the same project and they may not be 264 00:14:30,059 --> 00:14:33,919 literally disconnected, but in practice they aren't really 265 00:14:33,919 --> 00:14:36,049 well-connected either. 266 00:14:36,049 --> 00:14:40,559 So part of distribution is this offline work theme. 267 00:14:40,559 --> 00:14:43,859 Even if it's not completely offline, it is important to be 268 00:14:43,860 --> 00:14:47,340 able to do everything you want to do from any location 269 00:14:47,340 --> 00:14:49,639 without having to be able to access the server. 270 00:14:49,639 --> 00:14:53,519 271 00:14:53,519 --> 00:15:00,819 What that basic fact actually results in is that you 272 00:15:00,820 --> 00:15:04,629 effectively have a lot more branching because everybody 273 00:15:04,629 --> 00:15:09,259 who has a complete repository and can do commits on his own 274 00:15:09,259 --> 00:15:13,019 will effectively have his own branch, even if you don't 275 00:15:13,019 --> 00:15:13,689 realize it. 276 00:15:13,690 --> 00:15:17,650 Even if you think of your project as just having a 277 00:15:17,649 --> 00:15:20,750 single branch, every single time you disconnect your 278 00:15:20,750 --> 00:15:23,629 laptop and start working with it, you 279 00:15:23,629 --> 00:15:25,590 are on your own branch. 280 00:15:25,590 --> 00:15:28,680 And this is really, really important and is very 281 00:15:28,679 --> 00:15:32,819 different from anybody who's used CVS, where branching is 282 00:15:32,820 --> 00:15:36,280 considered something that only true gurus do. 283 00:15:36,279 --> 00:15:40,110 How many of you have ever used CVS? 284 00:15:40,110 --> 00:15:41,039 OK, everybody. 285 00:15:41,039 --> 00:15:43,889 How many of you have really done a branch and ever 286 00:15:43,889 --> 00:15:46,909 merged it in CVS? 287 00:15:46,909 --> 00:15:48,309 Good job. 288 00:15:48,309 --> 00:15:52,369 I mean, it wasn't everybody but it was actually more than 289 00:15:52,370 --> 00:15:53,139 I expected. 290 00:15:53,139 --> 00:15:57,419 How many of you enjoyed the experience? 291 00:15:57,419 --> 00:15:59,929 OK, so there were a couple. 292 00:15:59,929 --> 00:16:02,549 But it is considered hard. 293 00:16:02,549 --> 00:16:07,399 In CVS, when you merge a branch-- 294 00:16:07,399 --> 00:16:09,439 I've done it as little as possible, but 295 00:16:09,440 --> 00:16:10,470 I've had to do it-- 296 00:16:10,470 --> 00:16:13,639 what you do is you plan ahead for a week and then you 297 00:16:13,639 --> 00:16:16,319 basically set aside one day for doing it. 298 00:16:16,320 --> 00:16:18,379 Am I wrong? 299 00:16:18,379 --> 00:16:21,559 I'm not seeing a lot of people say no, it was easy. 300 00:16:21,559 --> 00:16:23,389 I liked it. 301 00:16:23,389 --> 00:16:25,340 It's horrible. 302 00:16:25,340 --> 00:16:27,790 If you're distributed, you have to realize that every 303 00:16:27,789 --> 00:16:30,639 single person has his own branch. 304 00:16:30,639 --> 00:16:31,850 It's horrible. 305 00:16:31,850 --> 00:16:35,680 It's not something you even have to set up. 306 00:16:35,679 --> 00:16:37,189 It just is. 307 00:16:37,190 --> 00:16:40,060 In fact, in Git, we like branches so much that a lot of 308 00:16:40,059 --> 00:16:42,849 people just have five or ten or fifteen of them. 309 00:16:42,850 --> 00:16:46,750 Just because once you realize that you have to have a 310 00:16:46,750 --> 00:16:49,649 special branch anyway, you might as well have many. 311 00:16:49,649 --> 00:16:53,870 And one of the branches you do some experimental work on and 312 00:16:53,870 --> 00:16:59,230 one of the branches you do maintenance on. 313 00:16:59,230 --> 00:17:02,389 So branching is much more inherent when you do 314 00:17:02,389 --> 00:17:03,639 distribution. 315 00:17:03,639 --> 00:17:05,740 316 00:17:05,740 --> 00:17:10,338 One of the other things that, to me, is very important is 317 00:17:10,338 --> 00:17:15,710 that by being distributed, you also automatically get to be 318 00:17:15,710 --> 00:17:18,348 slightly more trustworthy. 319 00:17:18,348 --> 00:17:20,490 I have a theory of backup switches. 320 00:17:20,490 --> 00:17:21,559 I don't do them. 321 00:17:21,559 --> 00:17:25,539 I put stuff up on one side and everybody else mirrors it. 322 00:17:25,539 --> 00:17:30,240 And if I crash my own machine I don't really care, because I 323 00:17:30,240 --> 00:17:33,920 can just download my own work right back. 324 00:17:33,920 --> 00:17:36,519 And it works beautifully well, and I don't have to have an 325 00:17:36,519 --> 00:17:37,470 MIS department. 326 00:17:37,470 --> 00:17:41,370 I heartily suggest everybody else do the same. 327 00:17:41,369 --> 00:17:46,369 But this only really works in a distributed environment. 328 00:17:46,369 --> 00:17:51,500 If you use CVS, you can't do this. 329 00:17:51,500 --> 00:17:52,380 What do you use here? 330 00:17:52,380 --> 00:17:53,910 Perforce? 331 00:17:53,910 --> 00:17:55,150 Perforce. 332 00:17:55,150 --> 00:17:58,019 I'm sorry. 333 00:17:58,019 --> 00:18:05,369 I'm sure it's better than CVS. [WHISPERS]. 334 00:18:05,369 --> 00:18:06,989 So that's part of it. 335 00:18:06,990 --> 00:18:11,870 One of the really nice things which is also-- 336 00:18:11,869 --> 00:18:15,009 maybe you don't have this issue inside a company, but we 337 00:18:15,009 --> 00:18:17,769 certainly have it in every single open source community 338 00:18:17,769 --> 00:18:20,250 I've ever seen that uses CVS or Subversion or something 339 00:18:20,250 --> 00:18:24,140 like that-- is you have this notion of commit access. 340 00:18:24,140 --> 00:18:26,280 Because you have a central repository, it means that 341 00:18:26,279 --> 00:18:29,369 everybody who is working on that project needs to write to 342 00:18:29,369 --> 00:18:33,479 the central repository, which means that since you don't 343 00:18:33,480 --> 00:18:36,089 want everybody to write to the central repository because 344 00:18:36,089 --> 00:18:40,829 most people are morons, you create this class of people 345 00:18:40,829 --> 00:18:43,609 who are ostensibly not morons. 346 00:18:43,609 --> 00:18:47,979 And most of the time, what happens is you make that class 347 00:18:47,980 --> 00:18:52,549 too small because it's really hard to know if a person is 348 00:18:52,549 --> 00:18:56,740 smart or not, and even when you make it too small, you 349 00:18:56,740 --> 00:19:02,029 will have problems. So this whole commit access issue, 350 00:19:02,029 --> 00:19:04,710 which some companies are able to ignore by just giving 351 00:19:04,710 --> 00:19:08,950 everybody commit access, is a huge psychological barrier and 352 00:19:08,950 --> 00:19:12,140 causes endless hours of politics in 353 00:19:12,140 --> 00:19:14,400 most open source projects. 354 00:19:14,400 --> 00:19:18,230 If you have a distributed model, it goes away. 355 00:19:18,230 --> 00:19:19,650 Everybody has commit access. 356 00:19:19,650 --> 00:19:23,100 You can do whatever you want to your project. 357 00:19:23,099 --> 00:19:27,789 You just get your own branch, you do great work or you do 358 00:19:27,789 --> 00:19:28,809 stupid work. 359 00:19:28,809 --> 00:19:29,720 Nobody cares. 360 00:19:29,720 --> 00:19:32,779 It's your copy, it's your branch. 361 00:19:32,779 --> 00:19:37,119 And later on, if it turns out you did a good job, you can 362 00:19:37,119 --> 00:19:39,169 tell people hey, here's my branch. 363 00:19:39,170 --> 00:19:42,640 And by the way, it performs 10 times faster than anybody 364 00:19:42,640 --> 00:19:46,310 else's branch, so nyah nyah nyah, how about 365 00:19:46,309 --> 00:19:48,039 pulling from me? 366 00:19:48,039 --> 00:19:49,059 And people do. 367 00:19:49,059 --> 00:19:50,819 And that's actually how it works, and we 368 00:19:50,819 --> 00:19:52,409 never have any politics. 369 00:19:52,410 --> 00:19:57,420 That's not quite true, but we have other politics. 370 00:19:57,420 --> 00:19:59,900 We don't have to worry about the commit access thing. 371 00:19:59,900 --> 00:20:05,769 And I think this is a huge issue and that alone should 372 00:20:05,769 --> 00:20:09,589 mean that every single open source system should never use 373 00:20:09,589 --> 00:20:11,319 anything but a distributed model. 374 00:20:11,319 --> 00:20:14,460 You get rid of a lot of issues. 375 00:20:14,460 --> 00:20:17,610 One of the things that commercial companies, 376 00:20:17,609 --> 00:20:19,549 distributed models actually help also 377 00:20:19,549 --> 00:20:21,329 with the release process. 378 00:20:21,329 --> 00:20:25,210 You can have a verification team that has its own tree, 379 00:20:25,210 --> 00:20:27,490 and they pull from people and they verify it. 380 00:20:27,490 --> 00:20:30,319 And when they've verified it, they can push it to the 381 00:20:30,319 --> 00:20:34,740 release team and say, hey, we have now verified our version. 382 00:20:34,740 --> 00:20:38,120 And the development people, they can go on playing with 383 00:20:38,119 --> 00:20:39,479 their head. 384 00:20:39,480 --> 00:20:43,170 Instead of having to create tagged branches, whatever you 385 00:20:43,170 --> 00:20:48,600 do to try to keep off each other's toes, again, you keep 386 00:20:48,599 --> 00:20:52,369 off each other's toes by just every single group can have 387 00:20:52,369 --> 00:20:59,379 its own tree and track its work and what they want done. 388 00:20:59,380 --> 00:21:06,100 So distributed is really, really central to any SCM you 389 00:21:06,099 --> 00:21:08,059 should ever use. 390 00:21:08,059 --> 00:21:11,855 So get rid of Perforce now. 391 00:21:11,855 --> 00:21:15,849 [APPLAUSE] 392 00:21:15,849 --> 00:21:20,349 LINUS TORVALDS: It's sad, but it is so, so true. 393 00:21:20,349 --> 00:21:22,734 That was my only real slide about distribution. 394 00:21:22,734 --> 00:21:25,949 395 00:21:25,950 --> 00:21:30,640 I'd love to get questions, because we're now moving into 396 00:21:30,640 --> 00:21:34,000 other areas that-- 397 00:21:34,000 --> 00:21:37,329 AUDIENCE: So how would you do it? 398 00:21:37,329 --> 00:21:41,869 If you had this monstrously awesomely big code base, and 399 00:21:41,869 --> 00:21:45,279 you wanted to use this without stopping business for six 400 00:21:45,279 --> 00:21:48,639 months, how would you do it? 401 00:21:48,640 --> 00:21:52,850 LINUS TORVALDS: Stay by the mic because I couldn't quite 402 00:21:52,849 --> 00:21:54,839 make out your question. 403 00:21:54,839 --> 00:21:56,539 OK, he went away. 404 00:21:56,539 --> 00:21:57,670 How would you do this? 405 00:21:57,670 --> 00:21:59,875 AUDIENCE: [INAUDIBLE]. 406 00:21:59,875 --> 00:22:06,789 LINUS TORVALDS: So an example of actual distribution is you 407 00:22:06,789 --> 00:22:09,819 have a group of five people working on one small, 408 00:22:09,819 --> 00:22:11,819 particular feature. 409 00:22:11,819 --> 00:22:16,119 And that means that for a while, that feature will be 410 00:22:16,119 --> 00:22:18,549 very, very broken, right? 411 00:22:18,549 --> 00:22:21,799 Because nobody actually creates perfect code the first 412 00:22:21,799 --> 00:22:25,990 time around except me, but there's only one of me. 413 00:22:25,990 --> 00:22:31,200 So what happens is they need to have their own tree that 414 00:22:31,200 --> 00:22:35,180 they can work in without affecting other people. 415 00:22:35,180 --> 00:22:37,420 You can do this many different ways. 416 00:22:37,420 --> 00:22:41,110 In CVS, one of the most common ways, because branches are so 417 00:22:41,109 --> 00:22:44,649 painful, is that you don't actually commit. 418 00:22:44,650 --> 00:22:49,380 You never commit until it passes every single test. For 419 00:22:49,380 --> 00:22:53,260 example, at your company you have a very strict committing 420 00:22:53,259 --> 00:22:58,129 rule saying you will never, ever commit until it's past 421 00:22:58,130 --> 00:22:59,310 the whole test suite. 422 00:22:59,309 --> 00:23:02,230 And by the way, the fact that the test suite takes two hours 423 00:23:02,230 --> 00:23:05,250 to run, tough. 424 00:23:05,250 --> 00:23:07,490 You cannot afford to commit. 425 00:23:07,490 --> 00:23:09,150 And this is something that happens at 426 00:23:09,150 --> 00:23:10,290 every single company. 427 00:23:10,289 --> 00:23:13,529 I bet it happens even here at Google. 428 00:23:13,529 --> 00:23:16,420 You probably have a strict test suite, and you are not 429 00:23:16,420 --> 00:23:19,150 supposed to commit unless it passes. 430 00:23:19,150 --> 00:23:22,759 And then in practice, people make one-liner changes and 431 00:23:22,759 --> 00:23:24,730 ignore the test suite because they know the one-liner 432 00:23:24,730 --> 00:23:25,980 changes can't possibly break. 433 00:23:25,980 --> 00:23:28,339 434 00:23:28,339 --> 00:23:31,869 This happens. 435 00:23:31,869 --> 00:23:34,579 This is a horrible, horrible model. 436 00:23:34,579 --> 00:23:36,889 It just means that you make huge commits because you 437 00:23:36,890 --> 00:23:40,040 commit something after you've worked on it for two weeks, 438 00:23:40,039 --> 00:23:44,039 and you have three people working in the same sandbox 439 00:23:44,039 --> 00:23:47,980 because before they commit, they can't see the changes 440 00:23:47,980 --> 00:23:49,529 that the other people made. 441 00:23:49,529 --> 00:23:50,230 This is common. 442 00:23:50,230 --> 00:23:52,220 It happens everywhere, it's scary. 443 00:23:52,220 --> 00:23:54,980 The other alternative is to use branches even in a 444 00:23:54,980 --> 00:23:56,089 centralized environment. 445 00:23:56,089 --> 00:23:59,879 But branches always end up being pretty expensive to do, 446 00:23:59,880 --> 00:24:02,015 so you can't do them for experimental features. 447 00:24:02,015 --> 00:24:05,320 448 00:24:05,319 --> 00:24:07,519 You don't know beforehand if it's something that's going to 449 00:24:07,519 --> 00:24:11,940 take one day or two weeks, but most of the time most 450 00:24:11,940 --> 00:24:16,759 programmers say, hey, I can do this in 48 hours. 451 00:24:16,759 --> 00:24:20,660 And it turns out, yeah, no you couldn't. 452 00:24:20,660 --> 00:24:24,050 But because you feel you can do it in 48 hours, creating a 453 00:24:24,049 --> 00:24:26,379 branch, even in systems that are better at creating 454 00:24:26,380 --> 00:24:29,210 branches than CVS, is a big pain. 455 00:24:29,210 --> 00:24:32,680 So you don't do it because you think you can get it resolved 456 00:24:32,680 --> 00:24:35,360 and you're back to case number one. 457 00:24:35,359 --> 00:24:38,349 But if you decide to create a branch, you will affect 458 00:24:38,349 --> 00:24:40,889 everybody else's repository because in a centralized 459 00:24:40,890 --> 00:24:43,290 environment, branches are global. 460 00:24:43,289 --> 00:24:45,339 So you're kind of screwing with everybody else, but at 461 00:24:45,339 --> 00:24:49,099 least you're not screwing with their main, head branch. 462 00:24:49,099 --> 00:24:54,009 You are adding stuff to their repositories, but hopefully in 463 00:24:54,009 --> 00:24:56,059 a way that they won't notice. 464 00:24:56,059 --> 00:25:01,389 But it does make everybody's repositories bigger. 465 00:25:01,390 --> 00:25:03,440 So either way, you can't win. 466 00:25:03,440 --> 00:25:06,519 In contrast, in a distributed environment, what you do is 467 00:25:06,519 --> 00:25:09,700 you have five people, they pull the current head, which 468 00:25:09,700 --> 00:25:13,120 is hopefully good and tested, and they start working on it 469 00:25:13,119 --> 00:25:14,669 and they start committing on it. 470 00:25:14,670 --> 00:25:18,970 And you don't need to wait for two weeks until your commits 471 00:25:18,970 --> 00:25:22,670 are stable because your commits are always local. 472 00:25:22,670 --> 00:25:25,420 And what happens is within that group of five people, you 473 00:25:25,420 --> 00:25:27,910 can pull from each other. 474 00:25:27,910 --> 00:25:29,920 That's what distributed means. 475 00:25:29,920 --> 00:25:33,620 There's no central location, it means everybody's the same. 476 00:25:33,619 --> 00:25:35,909 So you can merge between yourself. 477 00:25:35,910 --> 00:25:40,220 So not only can you commit every single line if you want 478 00:25:40,220 --> 00:25:43,610 to without having to run the two-hour test suite, but you 479 00:25:43,609 --> 00:25:46,769 can then communicate by pulling and merging each 480 00:25:46,769 --> 00:25:51,359 other's work and one person finds the bug again commits it 481 00:25:51,359 --> 00:25:54,189 and tells the other four people, hey, my repository has 482 00:25:54,190 --> 00:25:55,750 a fix for this. 483 00:25:55,750 --> 00:26:00,130 And then when that group is done two weeks later, they can 484 00:26:00,130 --> 00:26:03,700 tell their manager, hey, we've done this. 485 00:26:03,700 --> 00:26:08,259 Can you ask the main group to pull, and they'll get this new 486 00:26:08,259 --> 00:26:12,079 feature and by the way, we've tested it over two weeks and 487 00:26:12,079 --> 00:26:15,449 it works and it performs this much better because we have 488 00:26:15,450 --> 00:26:19,269 actually been able to time it before we even ask anybody 489 00:26:19,269 --> 00:26:21,789 else to look at it. 490 00:26:21,789 --> 00:26:25,269 And that's a hugely better model for doing development. 491 00:26:25,269 --> 00:26:29,079 And this is the model that the kernel uses. 492 00:26:29,079 --> 00:26:32,879 It turns out in many places, we don't need all that power, 493 00:26:32,880 --> 00:26:35,160 even in the kernel. 494 00:26:35,160 --> 00:26:45,730 So people usually don't pull within one group, but does it 495 00:26:45,730 --> 00:26:46,430 does happen. 496 00:26:46,430 --> 00:26:48,820 For example, the networking people sometimes affect the 497 00:26:48,819 --> 00:26:51,339 NFS people, and the fact that they can 498 00:26:51,339 --> 00:26:55,119 synchronize actually helps. 499 00:26:55,119 --> 00:26:58,289 So this is a real, practical advantage. 500 00:26:58,289 --> 00:26:59,339 Somebody else has a question. 501 00:26:59,339 --> 00:27:02,839 AUDIENCE: So it feels like the politics has just been moved 502 00:27:02,839 --> 00:27:04,699 to an indirect political question. 503 00:27:04,700 --> 00:27:07,069 If everyone's got access and they're all playing with their 504 00:27:07,069 --> 00:27:08,839 branches and they have their sandbox and they're having 505 00:27:08,839 --> 00:27:11,169 fun, at the end of the day there has to be merging and 506 00:27:11,170 --> 00:27:13,990 resolving unless you have 80 billion flavors 507 00:27:13,990 --> 00:27:14,930 of every Linux kernel. 508 00:27:14,930 --> 00:27:16,180 LINUS TORVALDS: Absolutely. 509 00:27:16,180 --> 00:27:18,340 510 00:27:18,339 --> 00:27:21,919 There will be 1,000 or maybe 20,000 different branches, but 511 00:27:21,920 --> 00:27:24,820 in practice you won't ever see them because they won't care. 512 00:27:24,819 --> 00:27:27,220 You will see like a few main branches, maybe 513 00:27:27,220 --> 00:27:28,579 you'll see only one. 514 00:27:28,579 --> 00:27:31,279 In the case of the kernel, a lot of people they only really 515 00:27:31,279 --> 00:27:32,789 look at my branch. 516 00:27:32,789 --> 00:27:34,799 So even though there are lots of branches, 517 00:27:34,799 --> 00:27:36,339 you can ignore them. 518 00:27:36,339 --> 00:27:40,179 What happens is the way merging is done is the way 519 00:27:40,180 --> 00:27:44,970 real security is done, by a network of trust. If you have 520 00:27:44,970 --> 00:27:48,670 ever done any security work and it did not involve the 521 00:27:48,670 --> 00:27:51,769 concept of network of trust, it wasn't security work. 522 00:27:51,769 --> 00:27:54,230 It was masturbation. 523 00:27:54,230 --> 00:27:57,190 I don't know what you were doing, but trust me, it's the 524 00:27:57,190 --> 00:28:00,559 only way you can do security, it's the only way you can do 525 00:28:00,559 --> 00:28:01,779 development. 526 00:28:01,779 --> 00:28:05,619 The way I work, I don't trust everybody. 527 00:28:05,619 --> 00:28:09,849 In fact, I am a very cynical and untrusting person. 528 00:28:09,849 --> 00:28:13,919 I think most of you are completely incompetent. 529 00:28:13,920 --> 00:28:16,390 The whole point of being distributed is I don't 530 00:28:16,390 --> 00:28:17,200 have to trust you. 531 00:28:17,200 --> 00:28:19,360 I don't have to give you commit access. 532 00:28:19,359 --> 00:28:24,399 But I know that among the multitude of average people, 533 00:28:24,400 --> 00:28:28,100 there are some people that just stand out, that I trust 534 00:28:28,099 --> 00:28:30,889 because I've been working with them. 535 00:28:30,890 --> 00:28:36,840 I only need to trust 5, 10, 15 people. 536 00:28:36,839 --> 00:28:40,740 If I have a network of trust that covers those 5, 10, 15 537 00:28:40,740 --> 00:28:43,809 people that are outstanding and I know they're 538 00:28:43,809 --> 00:28:48,059 outstanding, I can pull from them. 539 00:28:48,059 --> 00:28:51,419 I don't have to spend a lot of brain power on the question. 540 00:28:51,420 --> 00:28:54,529 When Andrew sends me patches-- he doesn't actually use Git, 541 00:28:54,529 --> 00:28:57,240 it's some kind of defect-- 542 00:28:57,240 --> 00:29:01,650 other than that, he's a very solid person. 543 00:29:01,650 --> 00:29:04,650 When he asks me to pull, he does it by sending me a 544 00:29:04,650 --> 00:29:05,340 million patches. 545 00:29:05,339 --> 00:29:08,039 Instead, I just do it. 546 00:29:08,039 --> 00:29:11,909 Sometimes I disagree with some of these patches, but at some 547 00:29:11,910 --> 00:29:16,450 point, trust means never having to say you're sorry. 548 00:29:16,450 --> 00:29:17,410 I don't know. 549 00:29:17,410 --> 00:29:20,029 It basically means you have to accept 550 00:29:20,029 --> 00:29:23,099 other people's decisions. 551 00:29:23,099 --> 00:29:26,639 The nice thing about trust is it does network, that's where 552 00:29:26,640 --> 00:29:28,509 the network of trust comes in. 553 00:29:28,509 --> 00:29:31,450 I only need to trust a few people that much. 554 00:29:31,450 --> 00:29:34,620 They have other people, they have determined, hey, that guy 555 00:29:34,619 --> 00:29:36,049 is actually smarter than I am. 556 00:29:36,049 --> 00:29:39,500 That's actually a really good measure of who 557 00:29:39,500 --> 00:29:40,869 you should pull from. 558 00:29:40,869 --> 00:29:43,239 If you have determined that somebody else is smarter than 559 00:29:43,240 --> 00:29:45,539 you, go for it. 560 00:29:45,539 --> 00:29:48,059 You can't lose, right? 561 00:29:48,059 --> 00:29:50,509 Even if it turns out you pulled crap and somebody else 562 00:29:50,509 --> 00:29:53,629 starts complaining, you know who you pulled from and you 563 00:29:53,630 --> 00:29:55,270 can just point to the other person and 564 00:29:55,269 --> 00:29:57,450 say, hey, I just pulled. 565 00:29:57,450 --> 00:29:59,390 Go to him, he knows what he's doing. 566 00:29:59,390 --> 00:30:00,440 So that's how I work. 567 00:30:00,440 --> 00:30:02,580 That's probably most of my lieutenants work. 568 00:30:02,579 --> 00:30:05,210 569 00:30:05,210 --> 00:30:09,600 I pull the networking changes from one person, he gets them 570 00:30:09,599 --> 00:30:13,069 from many other people that he's worked with over time. 571 00:30:13,069 --> 00:30:15,259 So this is how it all comes together. 572 00:30:15,259 --> 00:30:18,579 It doesn't have to come together to one point. 573 00:30:18,579 --> 00:30:21,369 In the kernel, it comes together to one point largely 574 00:30:21,369 --> 00:30:23,149 I think for historical reasons. 575 00:30:23,150 --> 00:30:27,480 And actually, I've always tried to encourage people to 576 00:30:27,480 --> 00:30:29,089 have more trees. 577 00:30:29,089 --> 00:30:33,619 So we do have vendor trees, we do have -mm trees, we have 578 00:30:33,619 --> 00:30:37,679 multiple one points, and it happens to be that my one 579 00:30:37,680 --> 00:30:39,759 point is getting maybe more attention 580 00:30:39,759 --> 00:30:41,759 than it always should. 581 00:30:41,759 --> 00:30:44,589 But even if it doesn't come down to one point, it means 582 00:30:44,589 --> 00:30:47,629 that you can take these thousands of branches and 583 00:30:47,630 --> 00:30:50,860 ignore 99.9% of them. 584 00:30:50,859 --> 00:30:53,469 And you know that, hey, there are five branches that are 585 00:30:53,470 --> 00:30:55,759 really interesting to follow because I'm interested in 586 00:30:55,759 --> 00:30:57,309 those sub-areas. 587 00:30:57,309 --> 00:30:59,379 And it all works very naturally. 588 00:30:59,380 --> 00:31:02,460 One of the nice things about this whole network of trust is 589 00:31:02,460 --> 00:31:06,039 it's not just easy to do technically, it's actually how 590 00:31:06,039 --> 00:31:09,329 every single person in this room is very fundamentally 591 00:31:09,329 --> 00:31:11,119 wired to work. 592 00:31:11,119 --> 00:31:12,779 It is how we think. 593 00:31:12,779 --> 00:31:15,440 We don't know 100 people. 594 00:31:15,440 --> 00:31:20,940 We have five, seven, ten close, personal friends. 595 00:31:20,940 --> 00:31:22,710 Well, we're geeks, so we have two. 596 00:31:22,710 --> 00:31:26,440 But I mean, that's basically how humans work, is that we 597 00:31:26,440 --> 00:31:29,289 have these people that we really trust. It's family, 598 00:31:29,289 --> 00:31:31,170 it's close friends. 599 00:31:31,170 --> 00:31:34,600 And it really fits. 600 00:31:34,599 --> 00:31:36,230 You don't even have to have a mental model. 601 00:31:36,230 --> 00:31:39,390 It fits how we are wired up. 602 00:31:39,390 --> 00:31:42,270 So there's huge advantages to this whole model network of 603 00:31:42,269 --> 00:31:45,349 trust. 604 00:31:45,349 --> 00:31:47,329 AUDIENCE: Do you know any companies that are using 605 00:31:47,329 --> 00:31:48,869 distributed systems internally? 606 00:31:48,869 --> 00:31:52,789 It seems like there might be a risk of vulcanizing the code 607 00:31:52,789 --> 00:31:56,349 base as people not being in the same sandbox don't 608 00:31:56,349 --> 00:31:56,939 contribute back. 609 00:31:56,940 --> 00:31:58,890 LINUS TORVALDS: So quite frankly, there aren't that 610 00:31:58,890 --> 00:32:02,900 many distributed systems. There is BitKeeper. 611 00:32:02,900 --> 00:32:06,080 It is clearly being used at commercial companies. 612 00:32:06,079 --> 00:32:10,079 We might have somebody in the audience who actually knows. 613 00:32:10,079 --> 00:32:10,569 What? 614 00:32:10,569 --> 00:32:12,193 AUDIENCE: [INAUDIBLE]. 615 00:32:12,193 --> 00:32:15,700 LINUS TORVALDS: So HP is using things like BitKeeper for the 616 00:32:15,700 --> 00:32:16,580 printer project. 617 00:32:16,579 --> 00:32:19,199 I'm sure they have a lot more companies. 618 00:32:19,200 --> 00:32:23,910 In the open source world, there are two distributed 619 00:32:23,910 --> 00:32:26,690 systems that are worth looking at right now. 620 00:32:26,690 --> 00:32:30,039 One of them is obviously Git and you really should pick 621 00:32:30,039 --> 00:32:33,509 that one, but the other one is Mercurial, which actually has 622 00:32:33,509 --> 00:32:36,650 pretty much the same time design. 623 00:32:36,650 --> 00:32:40,460 There are huge differences in implementation and there are 624 00:32:40,460 --> 00:32:44,490 some differences in details, but it boils down to a very 625 00:32:44,490 --> 00:32:45,164 similar model. 626 00:32:45,164 --> 00:32:48,069 Git just does it better. 627 00:32:48,069 --> 00:32:51,599 Everything else, it's either centralized or it is too 628 00:32:51,599 --> 00:32:54,750 unstable or too slow to use for anything big. 629 00:32:54,750 --> 00:32:58,039 AUDIENCE: Right, but is there an advantage for a company to 630 00:32:58,039 --> 00:33:00,009 have everybody playing in the same sandbox? 631 00:33:00,009 --> 00:33:02,170 LINUS TORVALDS: I think a lot of companies think there is an 632 00:33:02,170 --> 00:33:03,950 advantage to that. 633 00:33:03,950 --> 00:33:06,789 I know that inside companies, I don't think a lot of 634 00:33:06,789 --> 00:33:10,180 companies use Git knowingly in the sense that it 635 00:33:10,180 --> 00:33:13,529 is a company decision. 636 00:33:13,529 --> 00:33:17,599 I know several companies who use Git internally, not 637 00:33:17,599 --> 00:33:19,899 knowing that they do so because they actually have 638 00:33:19,900 --> 00:33:23,410 their main repository in Subversion and a lot of 639 00:33:23,410 --> 00:33:26,700 developers then import it into Git because Git can actually 640 00:33:26,700 --> 00:33:28,430 merge things for you. 641 00:33:28,430 --> 00:33:34,779 So you can take a Subversion tree, just import it into Git, 642 00:33:34,779 --> 00:33:38,665 let Git do the merge, which would be a major headache to 643 00:33:38,665 --> 00:33:41,549 do in Subversion, create a merge commit, and actually 644 00:33:41,549 --> 00:33:45,710 export it back to Subversion, and nobody else even 645 00:33:45,710 --> 00:33:47,940 knew you used Git. 646 00:33:47,940 --> 00:33:53,100 It's kind of sad, but we have cases of people talking about 647 00:33:53,099 --> 00:33:56,569 doing exactly that inside companies. 648 00:33:56,569 --> 00:34:01,379 Git has not been around in a form where a lot of people 649 00:34:01,380 --> 00:34:06,570 will be comfortable using it for more than 650 00:34:06,569 --> 00:34:08,750 half a year or so. 651 00:34:08,750 --> 00:34:14,110 We have had so huge improvements to the user 652 00:34:14,110 --> 00:34:17,800 interfaces that realistically, a year ago at a commercial 653 00:34:17,800 --> 00:34:20,370 company a lot of people would just have said 654 00:34:20,369 --> 00:34:22,159 it's too hard to use. 655 00:34:22,159 --> 00:34:24,570 I think we're way past that hump. 656 00:34:24,570 --> 00:34:27,199 Git is much easier to use than CVS, really. 657 00:34:27,199 --> 00:34:29,909 658 00:34:29,909 --> 00:34:32,170 It's easier to use than anything else. 659 00:34:32,170 --> 00:34:34,199 Just get over it. 660 00:34:34,199 --> 00:34:36,329 You don't have to use all the powerful tools. 661 00:34:36,329 --> 00:34:42,699 Some of them might be things you want to explain and 662 00:34:42,699 --> 00:34:45,869 introduce to people only after they got over the initial hump 663 00:34:45,869 --> 00:34:48,250 of understanding what distribution really means. 664 00:34:48,250 --> 00:34:53,838 But the basic stuff is really easy to do. 665 00:34:53,838 --> 00:34:57,050 AUDIENCE: One characteristic of a centralized system is 666 00:34:57,050 --> 00:34:59,610 that it's the original developer who has to resolve 667 00:34:59,610 --> 00:35:01,510 any merges, who has to fix merges. 668 00:35:01,510 --> 00:35:02,880 How do you do that in Git and how do you 669 00:35:02,880 --> 00:35:04,070 minimize merge conflicts? 670 00:35:04,070 --> 00:35:06,050 LINUS TORVALDS: Thank you for asking me that question. 671 00:35:06,050 --> 00:35:08,660 Did I tell you to ask that question first? 672 00:35:08,659 --> 00:35:13,480 One of the really nice parts of Git is A, it does make 673 00:35:13,480 --> 00:35:19,030 things much easier to merge than a lot of other systems. 674 00:35:19,030 --> 00:35:24,190 Merging a branch in CVS tends to be really painful. 675 00:35:24,190 --> 00:35:27,840 One of my main statistics is the kernel is actually one of 676 00:35:27,840 --> 00:35:29,360 the biggest open source projects. 677 00:35:29,360 --> 00:35:31,320 We have 22,000 files. 678 00:35:31,320 --> 00:35:33,460 We've used Git for two years. 679 00:35:33,460 --> 00:35:36,990 During those two years, we have averaged 4.5 merges a 680 00:35:36,989 --> 00:35:39,979 day, every single day. 681 00:35:39,980 --> 00:35:41,929 That's not something you'd do in something 682 00:35:41,929 --> 00:35:43,449 where merging was hard. 683 00:35:43,449 --> 00:35:48,139 So Git makes merging easy, but you will in inevitably have 684 00:35:48,139 --> 00:35:55,440 cases where two maintainers send me the question to please 685 00:35:55,440 --> 00:35:56,159 pull my stuff. 686 00:35:56,159 --> 00:35:59,099 And I pick one of them at random usually, because their 687 00:35:59,099 --> 00:36:02,469 mail happened to be first in my mailbox, and I have pull 688 00:36:02,469 --> 00:36:03,709 their stuff. 689 00:36:03,710 --> 00:36:08,090 And another person had made changes that-- 690 00:36:08,090 --> 00:36:12,000 it doesn't happen that often, but it does happen-- 691 00:36:12,000 --> 00:36:18,389 just clashed so much that I said, I could fix this up but 692 00:36:18,389 --> 00:36:19,829 I really don't want to. 693 00:36:19,829 --> 00:36:23,400 I didn't write the code, it's not my area of expertise, its 694 00:36:23,400 --> 00:36:26,490 networking or something like that, I can't really judge it, 695 00:36:26,489 --> 00:36:30,199 I can't test it, so asking me to resolve the 696 00:36:30,199 --> 00:36:31,289 merge is just crazy. 697 00:36:31,289 --> 00:36:35,050 It's not how you should do things. 698 00:36:35,050 --> 00:36:37,350 OK, the Windows machine flaked out again. 699 00:36:37,349 --> 00:36:41,119 700 00:36:41,119 --> 00:36:44,619 Remember, distribution means nobody is special. 701 00:36:44,619 --> 00:36:49,429 So instead of me merging, I just push out my first tree 702 00:36:49,429 --> 00:36:52,879 that didn't have any merge issues and I tell the second 703 00:36:52,880 --> 00:36:58,250 person, hey, I tried to pull from you but I had merge 704 00:36:58,250 --> 00:37:00,989 conflicts and they weren't completely trivial, so I 705 00:37:00,989 --> 00:37:04,179 decided you get to do the honors instead. 706 00:37:04,179 --> 00:37:05,389 And they do. 707 00:37:05,389 --> 00:37:07,000 And they know what they're doing 708 00:37:07,000 --> 00:37:08,650 because it's their changes. 709 00:37:08,650 --> 00:37:11,309 So they can do the merge and they probably think I'm a 710 00:37:11,309 --> 00:37:14,340 moron because the merge was so easy and it was obvious I 711 00:37:14,340 --> 00:37:17,090 should have taken their code. 712 00:37:17,090 --> 00:37:20,670 But they do the merge and then they update their tree and 713 00:37:20,670 --> 00:37:22,710 say, hey, can you pull from me now? 714 00:37:22,710 --> 00:37:25,650 And I pull from them and they did all the work for me. 715 00:37:25,650 --> 00:37:27,160 That's what it's all about. 716 00:37:27,159 --> 00:37:30,369 They did all the work for me. 717 00:37:30,369 --> 00:37:33,559 And I take the credit. 718 00:37:33,559 --> 00:37:35,974 Now I just need to figure out step three, profit. 719 00:37:35,974 --> 00:37:39,569 720 00:37:39,570 --> 00:37:43,450 But that's another thing that comes very naturally from 721 00:37:43,449 --> 00:37:44,519 being distributed. 722 00:37:44,519 --> 00:37:49,199 It's not something that is special to Git. 723 00:37:49,199 --> 00:37:55,149 Git makes merging easier than anything else, but Git does it 724 00:37:55,150 --> 00:37:57,030 exactly because Git is distributed. 725 00:37:57,030 --> 00:37:57,825 Yes. 726 00:37:57,824 --> 00:38:00,939 AUDIENCE: So I guess I don't entirely understand why you 727 00:38:00,940 --> 00:38:06,289 think that its necessary to have a distributed system-- 728 00:38:06,289 --> 00:38:11,509 it seems like you get a lot of the good effects, at least for 729 00:38:11,510 --> 00:38:12,120 corporate development. 730 00:38:12,119 --> 00:38:13,929 For open source development, it seems very useful that 731 00:38:13,929 --> 00:38:15,619 everybody can work on their own. 732 00:38:15,619 --> 00:38:20,429 But when you really have a centralized, corporate tree, 733 00:38:20,429 --> 00:38:23,500 then a centralized system with really cheap branches, 734 00:38:23,500 --> 00:38:25,972 wouldn't that give you pretty much the same effect? 735 00:38:25,972 --> 00:38:28,519 Or is that just impossible to do? 736 00:38:28,519 --> 00:38:29,769 LINUS TORVALDS: No. 737 00:38:29,769 --> 00:38:32,469 738 00:38:32,469 --> 00:38:35,899 I will argue that centralized systems can't work. 739 00:38:35,900 --> 00:38:39,380 But it is clearly true that if you're in a tightly controlled 740 00:38:39,380 --> 00:38:41,510 corporate environment, centralized 741 00:38:41,510 --> 00:38:42,900 systems work better. 742 00:38:42,900 --> 00:38:45,539 And it's unquestionably true that people have been able to 743 00:38:45,539 --> 00:38:49,009 use centralized system for the last 35 years. 744 00:38:49,010 --> 00:38:51,490 Nobody's really arguing that centralized 745 00:38:51,489 --> 00:38:53,939 systems cannot work. 746 00:38:53,940 --> 00:38:57,530 They cannot work as well as distributed systems. One of 747 00:38:57,530 --> 00:39:01,440 the issues you tend to have is centralized systems inevitably 748 00:39:01,440 --> 00:39:05,690 have problems when you have groups in different locations. 749 00:39:05,690 --> 00:39:08,280 It tends to work really well if you have a really beefy 750 00:39:08,280 --> 00:39:09,420 background fiber. 751 00:39:09,420 --> 00:39:12,740 And I guess for Google, you probably do have some kind of 752 00:39:12,739 --> 00:39:13,689 network going. 753 00:39:13,690 --> 00:39:14,679 I don't know. 754 00:39:14,679 --> 00:39:18,089 And maybe it's not as big of an issue as it is for other 755 00:39:18,090 --> 00:39:21,769 projects, but trust me. 756 00:39:21,769 --> 00:39:25,550 Not having to go over the network for everything is a 757 00:39:25,550 --> 00:39:26,820 huge performance saver. 758 00:39:26,820 --> 00:39:29,850 759 00:39:29,849 --> 00:39:32,159 I can't show you demonstrations and it's not a 760 00:39:32,159 --> 00:39:35,129 very interesting demonstration anyway, but this is a laptop 761 00:39:35,130 --> 00:39:38,269 that is what, four or five years old. 762 00:39:38,269 --> 00:39:41,789 It's like a Pentium M 1.6 gigahertz thing. 763 00:39:41,789 --> 00:39:47,210 I could show you me doing a full diff of the kernel on 764 00:39:47,210 --> 00:39:53,079 that laptop in whatever, just over a second. 765 00:39:53,079 --> 00:39:57,400 On my main machine, it takes less than 1/10 of a second. 766 00:39:57,400 --> 00:40:03,170 That's the kind of performance you simply cannot get if you 767 00:40:03,170 --> 00:40:05,260 have to go over a network. 768 00:40:05,260 --> 00:40:08,370 We're talking a couple of packets going over the network 769 00:40:08,369 --> 00:40:11,299 and you just blew the performance. 770 00:40:11,300 --> 00:40:15,820 So if you have a decentralized system and you're used to 771 00:40:15,820 --> 00:40:20,140 having something like commit or diffing the whole source 772 00:40:20,139 --> 00:40:24,400 tree taking 30 seconds. 773 00:40:24,400 --> 00:40:27,460 Maybe 30 seconds doesn't sound that bad to you. 774 00:40:27,460 --> 00:40:30,940 Trust me, when you're used to taking 1/10 of a second, 30 775 00:40:30,940 --> 00:40:33,750 seconds sounds pretty bad. 776 00:40:33,750 --> 00:40:37,489 So there are huge performance issues even if you have a good 777 00:40:37,489 --> 00:40:41,259 network, nevermind the fact that most people don't have a 778 00:40:41,260 --> 00:40:42,680 good network. 779 00:40:42,679 --> 00:40:45,549 The other thing is branches, even if you make them 780 00:40:45,550 --> 00:40:49,010 technically very cheap to create, just the fact that you 781 00:40:49,010 --> 00:40:54,050 create them and everybody sees them because everybody will 782 00:40:54,050 --> 00:40:57,050 see them since they're centralized, basically means 783 00:40:57,050 --> 00:41:00,550 that you don't want to make branches willy nilly. 784 00:41:00,550 --> 00:41:02,710 You will have namespace issues. 785 00:41:02,710 --> 00:41:04,119 What do you call your branch? 786 00:41:04,119 --> 00:41:05,349 Would you call it Test? 787 00:41:05,349 --> 00:41:09,230 Oh by the way, there's 5,000 other branches called Test 1 788 00:41:09,230 --> 00:41:11,409 through 5,000. 789 00:41:11,409 --> 00:41:14,480 So now you have to make up all these naming rules for your 790 00:41:14,480 --> 00:41:17,570 branches because you have a centralized system that has a 791 00:41:17,570 --> 00:41:21,400 centralized branch namespace, which is kind of inevitable 792 00:41:21,400 --> 00:41:23,970 when you have a centralized system. 793 00:41:23,969 --> 00:41:27,639 How does that work in distributed environments? 794 00:41:27,639 --> 00:41:30,710 You call your branch test, and it's that easy. 795 00:41:30,710 --> 00:41:32,210 Actually, you shouldn't call it tests. 796 00:41:32,210 --> 00:41:34,710 You should basically name your branches the way 797 00:41:34,710 --> 00:41:36,000 you name your functions. 798 00:41:36,000 --> 00:41:37,730 You should call them something short and 799 00:41:37,730 --> 00:41:38,840 sweet and to the point. 800 00:41:38,840 --> 00:41:40,510 What is that branch doing? 801 00:41:40,510 --> 00:41:44,630 Git, by default, gives you one branch that is called master. 802 00:41:44,630 --> 00:41:46,210 It's short and sweet and to the point. 803 00:41:46,210 --> 00:41:47,789 It's the master branch. 804 00:41:47,789 --> 00:41:52,000 But you can make a branch that is called Experimental Feature 805 00:41:52,000 --> 00:41:54,699 X and it will be obvious. 806 00:41:54,699 --> 00:41:57,449 But this is something you simply cannot do in a 807 00:41:57,449 --> 00:41:58,389 centralized environment. 808 00:41:58,389 --> 00:42:01,389 You cannot call branches Experimental Feature X. You 809 00:42:01,389 --> 00:42:04,739 have to make up stupid, idiotic names. 810 00:42:04,739 --> 00:42:07,929 I worked for a company that had nice-- 811 00:42:07,929 --> 00:42:12,750 as nice as you probably can make them-- scripts around CVS 812 00:42:12,750 --> 00:42:14,789 that helped you make branches. 813 00:42:14,789 --> 00:42:16,279 You could actually make branches 814 00:42:16,280 --> 00:42:17,710 with a simple command. 815 00:42:17,710 --> 00:42:19,679 It didn't take that long. 816 00:42:19,679 --> 00:42:24,730 It picked a name for you, exactly because it would pick 817 00:42:24,730 --> 00:42:25,579 the number. 818 00:42:25,579 --> 00:42:29,829 So you'd give it a base name and you would say, this is my 819 00:42:29,829 --> 00:42:32,539 branch for doing so and so and it would call your 820 00:42:32,539 --> 00:42:36,099 branch So and So-56. 821 00:42:36,099 --> 00:42:38,630 And it would tag where you started that branch because in 822 00:42:38,630 --> 00:42:41,890 CVS you need to do that, too. 823 00:42:41,889 --> 00:42:43,969 It took a while, but it worked. 824 00:42:43,969 --> 00:42:47,209 You can do these things in centralized systems, but you 825 00:42:47,210 --> 00:42:49,099 don't need to. 826 00:42:49,099 --> 00:42:53,469 If your system is decentralized, it just works. 827 00:42:53,469 --> 00:42:55,939 That is how it should work. 828 00:42:55,940 --> 00:43:00,730 So I'm not going to force you to switch over to 829 00:43:00,730 --> 00:43:02,369 decentralized, I'm just going to call you 830 00:43:02,369 --> 00:43:04,670 you ugly and stupid. 831 00:43:04,670 --> 00:43:06,789 That's the deal. 832 00:43:06,789 --> 00:43:09,516 Anyway, we are on the performance slide. 833 00:43:09,516 --> 00:43:11,746 AUDIENCE: Can I ask a question? 834 00:43:11,746 --> 00:43:12,579 LINUS TORVALDS: Yes. 835 00:43:12,579 --> 00:43:14,340 AUDIENCE: Two questions, actually. 836 00:43:14,340 --> 00:43:19,510 So one is how many files will Git take. 837 00:43:19,510 --> 00:43:22,300 And then the second one, let's say if you have a humongous 838 00:43:22,300 --> 00:43:25,170 tree under Git, would it be possible to check 839 00:43:25,170 --> 00:43:26,610 out part of the tree? 840 00:43:26,610 --> 00:43:27,860 LINUS TORVALDS: Great questions. 841 00:43:27,860 --> 00:43:30,090 842 00:43:30,090 --> 00:43:33,420 Those questions actually kind of dovetail into a different 843 00:43:33,420 --> 00:43:36,480 issue, even though they are performance related. 844 00:43:36,480 --> 00:43:40,349 One of the things that Git is really special about, and this 845 00:43:40,349 --> 00:43:43,710 special even with regards to things like Mercurial which is 846 00:43:43,710 --> 00:43:48,970 otherwise fairly similar, Git tracks your content. 847 00:43:48,969 --> 00:43:51,439 It never, ever tracks a single file. 848 00:43:51,440 --> 00:43:55,450 You cannot track a file in Git. 849 00:43:55,449 --> 00:43:58,230 What you can do is you can track a project that has a 850 00:43:58,230 --> 00:44:02,179 single file, but if your project has a single file, 851 00:44:02,179 --> 00:44:04,730 sure do that, and you can do it. 852 00:44:04,730 --> 00:44:11,090 But if you track 10,000 files, Git never, ever sees those as 853 00:44:11,090 --> 00:44:12,740 individual files. 854 00:44:12,739 --> 00:44:16,099 Git thinks of everything as the full content. 855 00:44:16,099 --> 00:44:19,519 All history in Git is based on the content of all of the 856 00:44:19,519 --> 00:44:21,869 history of the whole project. 857 00:44:21,869 --> 00:44:26,349 This has implications for performance. 858 00:44:26,349 --> 00:44:29,199 When you use CVS it's perfectly fine. 859 00:44:29,199 --> 00:44:32,259 It's stupid, but it's perfectly fine to have one 860 00:44:32,260 --> 00:44:36,470 huge repository that has a million files in it because at 861 00:44:36,469 --> 00:44:39,139 the end of the day, CVS actually thinks of all those 862 00:44:39,139 --> 00:44:41,319 million files as a single file. 863 00:44:41,320 --> 00:44:44,570 And you can actually ask CVS to only update that one file 864 00:44:44,570 --> 00:44:47,300 because CVS really thinks in those terms. And that's 865 00:44:47,300 --> 00:44:49,960 actually true of pretty much everything else too. 866 00:44:49,960 --> 00:44:51,720 It is actually even true of BitKeeper. 867 00:44:51,719 --> 00:44:56,109 That was one of the mistakes in BitKeeper. 868 00:44:56,110 --> 00:44:59,590 The problem with thinking in terms of single files is that 869 00:44:59,590 --> 00:45:02,079 quite often, especially if you're a high level maintainer 870 00:45:02,079 --> 00:45:06,840 like me, I have 22,000 files to track, I don't care about 871 00:45:06,840 --> 00:45:08,010 one of them. 872 00:45:08,010 --> 00:45:13,150 I might care about a sub-collection of them that 873 00:45:13,150 --> 00:45:14,780 contains maybe 1,000 thousand files. 874 00:45:14,780 --> 00:45:18,470 I might care about the USB subsystem, but I never care 875 00:45:18,469 --> 00:45:20,239 about the single file. 876 00:45:20,239 --> 00:45:23,619 So Git tracks everything as a collection of files, and if 877 00:45:23,619 --> 00:45:26,549 you ask for the history of a single file, Git will 878 00:45:26,550 --> 00:45:28,670 literally start from the global history 879 00:45:28,670 --> 00:45:30,345 and simplify it. 880 00:45:30,344 --> 00:45:33,509 It's a fairly efficient system. 881 00:45:33,510 --> 00:45:34,820 It's a very efficient system. 882 00:45:34,820 --> 00:45:37,420 You would normally not even realize that it does that. 883 00:45:37,420 --> 00:45:42,159 But it does mean that if you try to track a million files 884 00:45:42,159 --> 00:45:46,170 in one repository, when you then ask for a single file 885 00:45:46,170 --> 00:45:48,619 history it's going to be slower. 886 00:45:48,619 --> 00:45:51,469 So it has different scaling properties than a lot of other 887 00:45:51,469 --> 00:45:55,669 systems for this very fundamental design reason. 888 00:45:55,670 --> 00:45:58,389 We have used big repositories. 889 00:45:58,389 --> 00:46:08,769 We've imported things like something like 3/4 of the 890 00:46:08,769 --> 00:46:11,360 Subversion history of the whole KDE project. 891 00:46:11,360 --> 00:46:12,620 And the KDE people are-- 892 00:46:12,619 --> 00:46:19,109 893 00:46:19,110 --> 00:46:23,150 I like KDE but trust me, they put every single component in 894 00:46:23,150 --> 00:46:25,630 one repository. 895 00:46:25,630 --> 00:46:27,920 Not very smart. 896 00:46:27,920 --> 00:46:31,170 What you ended up with, you had a repository that took I 897 00:46:31,170 --> 00:46:36,769 think eight gigabytes under the CVS tree and Subversion 898 00:46:36,769 --> 00:46:40,090 blew it up to like three times that size. 899 00:46:40,090 --> 00:46:42,059 Maybe it wasn't quite eight gigabytes in 900 00:46:42,059 --> 00:46:43,539 CVS, but it was big. 901 00:46:43,539 --> 00:46:46,559 It was more than four gigabytes. 902 00:46:46,559 --> 00:46:48,710 Git would actually compress it down to 903 00:46:48,710 --> 00:46:50,159 something like 1.3 gigabytes. 904 00:46:50,159 --> 00:46:53,649 So Git is actually very efficient at taking this 905 00:46:53,650 --> 00:46:57,369 project and just smushing it together and most things 906 00:46:57,369 --> 00:46:58,980 perform very well. 907 00:46:58,980 --> 00:47:00,490 But certain things did not. 908 00:47:00,489 --> 00:47:02,909 The things that do not perform very well, if you put a 909 00:47:02,909 --> 00:47:06,279 million files in one repository, initial clones, 910 00:47:06,280 --> 00:47:08,780 when you get it, you get it all. 911 00:47:08,780 --> 00:47:13,540 You put it in one repository, Git thinks of it as one thing. 912 00:47:13,539 --> 00:47:14,929 Don't do that. 913 00:47:14,929 --> 00:47:17,539 If you have multiple components, do them as 914 00:47:17,539 --> 00:47:19,000 separate repositories. 915 00:47:19,000 --> 00:47:22,179 You can actually have what we call a super project that 916 00:47:22,179 --> 00:47:25,210 contains pointers to other projects and the user 917 00:47:25,210 --> 00:47:28,059 interfaces there are some lacking. 918 00:47:28,059 --> 00:47:31,920 But you keep separate projects separate, and then you avoid 919 00:47:31,920 --> 00:47:35,200 the problem of, OK, you have to get it all. 920 00:47:35,199 --> 00:47:37,099 Because with Git, you do have to get it all. 921 00:47:37,099 --> 00:47:40,041 AUDIENCE: Why don't they all share code? 922 00:47:40,041 --> 00:47:42,447 [INAUDIBLE]? 923 00:47:42,447 --> 00:47:45,410 LINUS TORVALDS: If they all shared code. 924 00:47:45,409 --> 00:47:47,829 What you can do with Git, if you actually have a lot of 925 00:47:47,829 --> 00:47:50,690 shared stuff, since Git actually internally uses a 926 00:47:50,690 --> 00:47:54,970 content-addressable file system, if there are files 927 00:47:54,969 --> 00:47:57,799 with identical content, Git will actually use the exact 928 00:47:57,800 --> 00:48:01,580 same object for them and save you tons of space. 929 00:48:01,579 --> 00:48:06,380 You can have these shared objects and still have them as 930 00:48:06,380 --> 00:48:07,670 separate entities. 931 00:48:07,670 --> 00:48:10,220 You can still see them at separate repositories that 932 00:48:10,219 --> 00:48:13,259 just have a shared file system backing the 933 00:48:13,260 --> 00:48:16,190 data, you can do that. 934 00:48:16,190 --> 00:48:18,480 If you actually have shared code in the sense that you, 935 00:48:18,480 --> 00:48:22,320 for example, have a library that is used by five different 936 00:48:22,320 --> 00:48:26,870 things, that's when you use the super project support, 937 00:48:26,869 --> 00:48:30,289 where you have one Git repository that just tracks 938 00:48:30,289 --> 00:48:31,739 all the other Git repositories. 939 00:48:31,739 --> 00:48:34,449 It may contain stuff like a shared build 940 00:48:34,449 --> 00:48:36,339 infrastructure, too. 941 00:48:36,340 --> 00:48:39,690 But then the individual pieces are individual. 942 00:48:39,690 --> 00:48:42,000 These are like CVS modules. 943 00:48:42,000 --> 00:48:47,989 In CVS, modules aren't really individual but that's because 944 00:48:47,989 --> 00:48:52,809 in CVS, the directory is a thing of its own anyway. 945 00:48:52,809 --> 00:48:58,079 So CVS modules are kind of a combination of this and just 946 00:48:58,079 --> 00:48:59,250 tracking them all. 947 00:48:59,250 --> 00:49:02,349 But you can basically think of it as CVS modules. 948 00:49:02,349 --> 00:49:05,460 And we do support it but I do have to admit, that code is 949 00:49:05,460 --> 00:49:09,429 fairly recent and that's one area where our user interfaces 950 00:49:09,429 --> 00:49:13,889 right now are definitely lacking some. 951 00:49:13,889 --> 00:49:16,409 There was probably some other part to that question that I 952 00:49:16,409 --> 00:49:18,469 completely forgot. 953 00:49:18,469 --> 00:49:19,719 AUDIENCE: [INAUDIBLE]. 954 00:49:19,719 --> 00:49:22,547 955 00:49:22,547 --> 00:49:26,159 LINUS TORVALDS: I can't hear that. 956 00:49:26,159 --> 00:49:29,429 AUDIENCE: The question was, can you have just part of the 957 00:49:29,429 --> 00:49:32,750 files pulled out of the repository, not the entire 958 00:49:32,750 --> 00:49:35,197 repository? 959 00:49:35,197 --> 00:49:38,109 LINUS TORVALDS: You can export things as tarballs, you can 960 00:49:38,110 --> 00:49:40,170 export things as individual files. 961 00:49:40,170 --> 00:49:44,510 You can rewrite the whole history to say, I want a new 962 00:49:44,510 --> 00:49:47,320 version of that repository that only contains that part. 963 00:49:47,320 --> 00:49:48,210 You can do that. 964 00:49:48,210 --> 00:49:50,090 It's a fairly expensive operation. 965 00:49:50,090 --> 00:49:53,550 It's something you would do, for example, if you import an 966 00:49:53,550 --> 00:49:57,430 old repository into one huge Git repository and then you 967 00:49:57,429 --> 00:50:01,299 can split it later on to be multiple, smaller ones. 968 00:50:01,300 --> 00:50:03,650 You can do it. 969 00:50:03,650 --> 00:50:05,630 What I'm trying to say, you should generally 970 00:50:05,630 --> 00:50:08,030 try to avoid it. 971 00:50:08,030 --> 00:50:12,440 It's not that Git can't handle huge projects, it's that Git 972 00:50:12,440 --> 00:50:17,429 won't perform as well as it would otherwise and you will 973 00:50:17,429 --> 00:50:22,710 have issues that you wish you didn't have. 974 00:50:22,710 --> 00:50:26,199 I'm skipping this and going back to the performance issue. 975 00:50:26,199 --> 00:50:29,199 One of the things I want to say about performance is a lot 976 00:50:29,199 --> 00:50:32,119 of people seem to think that performance is about doing the 977 00:50:32,119 --> 00:50:34,230 same thing, just doing it faster. 978 00:50:34,230 --> 00:50:35,440 And that's not true. 979 00:50:35,440 --> 00:50:37,720 That's not what performance is all about. 980 00:50:37,719 --> 00:50:42,049 If you can do something really fast really well, people start 981 00:50:42,050 --> 00:50:44,860 using it differently. 982 00:50:44,860 --> 00:50:48,510 One of the things I wanted to make sure is that merges go 983 00:50:48,510 --> 00:50:51,920 really, really quickly because I want people to merge often 984 00:50:51,920 --> 00:50:55,789 and merge early because it turns out it 985 00:50:55,789 --> 00:50:57,139 becomes easier to merge. 986 00:50:57,139 --> 00:51:00,949 If you merge every day, suddenly you never get to the 987 00:51:00,949 --> 00:51:02,730 point where you have huge conflicts 988 00:51:02,730 --> 00:51:04,289 that are hard to resolve. 989 00:51:04,289 --> 00:51:10,259 If you actually make branching and merging easy, you actually 990 00:51:10,260 --> 00:51:13,450 avoid a whole class of problems that you otherwise 991 00:51:13,449 --> 00:51:15,719 have a really, really hard time avoiding. 992 00:51:15,719 --> 00:51:20,359 So for example, let's go back to one of the things where I 993 00:51:20,360 --> 00:51:22,160 think the designers of Subversion 994 00:51:22,159 --> 00:51:24,869 were complete morons. 995 00:51:24,869 --> 00:51:25,859 Strong opinions. 996 00:51:25,860 --> 00:51:27,655 That's me, right? 997 00:51:27,655 --> 00:51:31,960 There's a few of them in the room today, I suspect. 998 00:51:31,960 --> 00:51:34,550 You're stupid. 999 00:51:34,550 --> 00:51:39,990 Subversion, for example, talks very loudly about how they do 1000 00:51:39,989 --> 00:51:44,889 CVS right by making branching really cheap. 1001 00:51:44,889 --> 00:51:47,829 It's probably on their main web page where they probably 1002 00:51:47,829 --> 00:51:51,769 say that branching in Subversion is an 01 operation. 1003 00:51:51,769 --> 00:51:54,480 You can do as many cheap branches as you want. 1004 00:51:54,480 --> 00:51:56,809 Nevermind that the 01 is actually a 1005 00:51:56,809 --> 00:51:58,259 pretty large 0 I think. 1006 00:51:58,260 --> 00:52:03,510 But even if it takes a millionth of a second to do 1007 00:52:03,510 --> 00:52:06,900 branching, who cares? 1008 00:52:06,900 --> 00:52:09,300 It's the wrong thing you're measuring. 1009 00:52:09,300 --> 00:52:11,340 Nobody is interest in branching. 1010 00:52:11,340 --> 00:52:14,230 Branches are completely useless unless you merge them, 1011 00:52:14,230 --> 00:52:19,300 and CVS cannot merge anything at all. 1012 00:52:19,300 --> 00:52:23,170 You can merge things once, but because CVS then forgets what 1013 00:52:23,170 --> 00:52:27,190 you did, you can never, ever merge anything again without 1014 00:52:27,190 --> 00:52:30,210 getting horrible, horrible conflicts. 1015 00:52:30,210 --> 00:52:34,019 Merging in Subversion is a complete disaster. 1016 00:52:34,019 --> 00:52:36,759 The Subversion people kind of acknowledge this and they have 1017 00:52:36,760 --> 00:52:40,190 a plan and their plan sucks, too. 1018 00:52:40,190 --> 00:52:43,960 It is incredible how stupid these people are. 1019 00:52:43,960 --> 00:52:47,690 They've been looking at the wrong problem all the time. 1020 00:52:47,690 --> 00:52:51,950 Branching is not the issue, merging is, and merging they 1021 00:52:51,949 --> 00:52:56,079 didn't do squat for five years after the fact. 1022 00:52:56,079 --> 00:52:57,779 That is sad. 1023 00:52:57,780 --> 00:53:00,670 So performance is important, but you need to 1024 00:53:00,670 --> 00:53:03,360 look at what matters. 1025 00:53:03,360 --> 00:53:07,000 Performance for making a branch under Git, literally 1026 00:53:07,000 --> 00:53:11,219 you create a new file that is 41 bytes in size. 1027 00:53:11,219 --> 00:53:13,980 How fast do you think that is? 1028 00:53:13,980 --> 00:53:16,710 I don't think you can measure it. 1029 00:53:16,710 --> 00:53:18,320 If you use Windows you can probably 1030 00:53:18,320 --> 00:53:19,650 measure it because file-- 1031 00:53:19,650 --> 00:53:20,910 but whatever. 1032 00:53:20,909 --> 00:53:23,809 It is so fast you can't really measure it. 1033 00:53:23,809 --> 00:53:25,900 That's creating a branch. 1034 00:53:25,900 --> 00:53:28,329 Nobody cares. 1035 00:53:28,329 --> 00:53:30,420 It's not an issue. 1036 00:53:30,420 --> 00:53:32,690 That's not it. 1037 00:53:32,690 --> 00:53:36,909 The only thing that matters is how fast can you merge? 1038 00:53:36,909 --> 00:53:43,079 In Git I merge 22,000 several times a day and I get unhappy 1039 00:53:43,079 --> 00:53:45,719 if a merge takes more than five seconds. 1040 00:53:45,719 --> 00:53:49,579 And all of those five seconds are just the downloading of 1041 00:53:49,579 --> 00:53:53,829 the deltas between the two trees. 1042 00:53:53,829 --> 00:53:58,190 The merge itself takes less than half a second, and I 1043 00:53:58,190 --> 00:54:00,909 don't have to think about it. 1044 00:54:00,909 --> 00:54:04,219 What takes longer than the merge is after every merge by 1045 00:54:04,219 --> 00:54:06,719 default, Git will do a diff stat of everything that 1046 00:54:06,719 --> 00:54:09,569 changed as a result of that merge because I 1047 00:54:09,570 --> 00:54:10,990 do care about that. 1048 00:54:10,989 --> 00:54:13,579 When I merged from somebody, I trust them. 1049 00:54:13,579 --> 00:54:16,139 But on the other hand, hey, they might have stopped using 1050 00:54:16,139 --> 00:54:17,940 their medication. 1051 00:54:17,940 --> 00:54:22,019 I mean, I trust them, but let's just be honest here. 1052 00:54:22,019 --> 00:54:23,630 They might have been OK yesterday, 1053 00:54:23,630 --> 00:54:25,380 today not a good day. 1054 00:54:25,380 --> 00:54:28,500 So I do a diff stat and Git does that by default. 1055 00:54:28,500 --> 00:54:30,530 You can turn it off if you really want to, but you 1056 00:54:30,530 --> 00:54:31,330 probably shouldn't. 1057 00:54:31,329 --> 00:54:34,230 It's fast enough anyway. 1058 00:54:34,230 --> 00:54:37,210 If it's a big merge, the diff stat usually takes a second or 1059 00:54:37,210 --> 00:54:40,990 two because creating a diff and actually doing all the 1060 00:54:40,989 --> 00:54:44,619 stats on how many lines changed, that actually is much 1061 00:54:44,619 --> 00:54:47,309 more expensive than doing the merge itself. 1062 00:54:47,309 --> 00:54:49,519 That is the kind of performance that actually 1063 00:54:49,519 --> 00:54:51,230 changes how you work. 1064 00:54:51,230 --> 00:54:54,990 It's no longer doing the same thing faster, it's allowing 1065 00:54:54,989 --> 00:54:58,439 you to work in a completely different manner, and that is 1066 00:54:58,440 --> 00:55:01,659 why performance matters and why you really shouldn't look 1067 00:55:01,659 --> 00:55:03,379 at anything but Git. 1068 00:55:03,380 --> 00:55:08,849 Hg's Mercurial is pretty good, but Git is better. 1069 00:55:08,849 --> 00:55:12,269 I think I'm running out of time. 1070 00:55:12,269 --> 00:55:14,679 OK, this one is still interesting. 1071 00:55:14,679 --> 00:55:16,299 We never got to the implementation part, you 1072 00:55:16,300 --> 00:55:17,970 really don't care. 1073 00:55:17,969 --> 00:55:20,129 I will say so much about implementation is the 1074 00:55:20,130 --> 00:55:22,000 implementation is really simple. 1075 00:55:22,000 --> 00:55:26,699 The code data structures are really, really, really simple. 1076 00:55:26,699 --> 00:55:32,949 If you then look at the source code and realize it's 80,000 1077 00:55:32,949 --> 00:55:37,369 lines and mostly in C. And the kind of C I write most people 1078 00:55:37,369 --> 00:55:40,409 don't understand, but I commented. 1079 00:55:40,409 --> 00:55:43,639 The source code may sometimes look complicated because we 1080 00:55:43,639 --> 00:55:46,799 are very performance-centric. 1081 00:55:46,800 --> 00:55:48,180 I am, I really care. 1082 00:55:48,179 --> 00:55:51,679 And sometimes to make things go really fast, you have to 1083 00:55:51,679 --> 00:55:54,690 use more complicated algorithms than just checking 1084 00:55:54,690 --> 00:55:56,059 one file at the time. 1085 00:55:56,059 --> 00:55:59,299 When you're doing 22,000-file merges, you don't want to 1086 00:55:59,300 --> 00:56:00,630 check one file at a time. 1087 00:56:00,630 --> 00:56:03,320 You want to check the whole three in one go and say, 1088 00:56:03,320 --> 00:56:06,470 they're the same, I didn't need to do anything. 1089 00:56:06,469 --> 00:56:08,759 So Git does things like that and that kind of blows the 1090 00:56:08,760 --> 00:56:12,650 source code up a bit because doing it well is complicated. 1091 00:56:12,650 --> 00:56:16,360 But the basics are really, really simple, and one of the 1092 00:56:16,360 --> 00:56:20,039 basics is this trust and reliability thing. 1093 00:56:20,039 --> 00:56:25,599 Every single piece of data, when Git tracks your content, 1094 00:56:25,599 --> 00:56:28,429 we compress it, we delta it against everything else. 1095 00:56:28,429 --> 00:56:33,190 But we also do a SHA-1 hash of the content, and we actually 1096 00:56:33,190 --> 00:56:35,230 check it when we use it. 1097 00:56:35,230 --> 00:56:39,099 If you have disk corruption, if you have DRAM corruption, 1098 00:56:39,099 --> 00:56:43,139 if you have any kind of problems at all, Git will 1099 00:56:43,139 --> 00:56:44,109 notice them. 1100 00:56:44,110 --> 00:56:47,829 It's not a question of if, it's a guarantee. 1101 00:56:47,829 --> 00:56:51,099 You can have people who try to be malicious. 1102 00:56:51,099 --> 00:56:53,559 They won't succeed. 1103 00:56:53,559 --> 00:56:57,849 You need to know exactly 20 bytes, you need to know the 1104 00:56:57,849 --> 00:57:04,480 160-bit SHA-1 name of your top of tree, and if you know that, 1105 00:57:04,480 --> 00:57:06,949 you can trust your tree all the way 1106 00:57:06,949 --> 00:57:08,519 down, the whole history. 1107 00:57:08,519 --> 00:57:11,449 You can have 10 years of history, you can have 100,000 1108 00:57:11,449 --> 00:57:14,439 files, you can have millions of revisions, and you can 1109 00:57:14,440 --> 00:57:20,559 trust every single piece of it because Git is so reliable and 1110 00:57:20,559 --> 00:57:24,059 all the basic data structures are really, really simple. 1111 00:57:24,059 --> 00:57:26,299 And we check checksums. 1112 00:57:26,300 --> 00:57:32,650 And we don't just check some piddly UDP packet checksum 1113 00:57:32,650 --> 00:57:36,180 that is a 16-bit sum of all the bytes. 1114 00:57:36,179 --> 00:57:39,429 We check a checksum that is considered 1115 00:57:39,429 --> 00:57:41,609 cryptographically secure. 1116 00:57:41,610 --> 00:57:45,289 Nobody has been able to break SHA-1, but the point is the 1117 00:57:45,289 --> 00:57:47,949 SHA-1, as far as Git is concerned, isn't even a 1118 00:57:47,949 --> 00:57:49,569 security feature. 1119 00:57:49,570 --> 00:57:53,530 It's purely a consistency check. 1120 00:57:53,530 --> 00:57:58,380 The security parts are elsewhere, so a lot of people 1121 00:57:58,380 --> 00:58:03,240 assume that since Git uses SHA-1 and SHA-1 is used for 1122 00:58:03,239 --> 00:58:06,039 cryptographically secure stuff, they think that, OK, 1123 00:58:06,039 --> 00:58:08,840 it's a huge security feature. 1124 00:58:08,840 --> 00:58:11,329 It has nothing at all to do with security, it's just the 1125 00:58:11,329 --> 00:58:14,090 best hash you can get. 1126 00:58:14,090 --> 00:58:17,684 Having a good hash is good for being able to trust your data. 1127 00:58:17,684 --> 00:58:21,360 It happens to have some other good features, too. 1128 00:58:21,360 --> 00:58:23,630 It means that when we hash objects, we know that the 1129 00:58:23,630 --> 00:58:26,599 hashes are actually well-distributed and we don't 1130 00:58:26,599 --> 00:58:28,529 have to worry about certain distribution issues. 1131 00:58:28,530 --> 00:58:32,130 So internally, it means from an implementation standpoint 1132 00:58:32,130 --> 00:58:36,599 we can trust that the hashes are so good that we can use 1133 00:58:36,599 --> 00:58:39,750 hashing algorithms and know that there are no bad cases. 1134 00:58:39,750 --> 00:58:43,000 So there are some reasons to like the 1135 00:58:43,000 --> 00:58:44,829 cryptographic site, too. 1136 00:58:44,829 --> 00:58:50,039 But it's really about the ability to trust your data. 1137 00:58:50,039 --> 00:58:55,590 I guarantee you, if you put your data in Git, you can 1138 00:58:55,590 --> 00:58:58,850 trust the fact that five years later, after it was converted 1139 00:58:58,849 --> 00:59:03,579 from your hard disk to DVD to whatever new technology and 1140 00:59:03,579 --> 00:59:06,989 you copied it along, five years later you can verify 1141 00:59:06,989 --> 00:59:10,609 that the data you get back out is the exact same 1142 00:59:10,610 --> 00:59:11,680 data you put in. 1143 00:59:11,679 --> 00:59:13,909 And that's something you really should look for in a 1144 00:59:13,909 --> 00:59:16,029 source control management system. 1145 00:59:16,030 --> 00:59:20,370 One of the reasons I care is for the kernel, we had a break 1146 00:59:20,369 --> 00:59:27,549 in on one of the BitKeeper sites where people tried to 1147 00:59:27,550 --> 00:59:30,410 corrupt the kernel source code repositories. 1148 00:59:30,409 --> 00:59:31,969 And BitKeeper actually caught it. 1149 00:59:31,969 --> 00:59:35,759 BitKeeper did not have a really fancy hash at all. 1150 00:59:35,760 --> 00:59:40,370 I think it's a 16-bit CRC, something like that. 1151 00:59:40,369 --> 00:59:46,750 But it was good enough that you could actually see clumsy. 1152 00:59:46,750 --> 00:59:49,829 It was not cryptographically secure, but it was hard enough 1153 00:59:49,829 --> 00:59:55,694 in practice to overcome that it was caught immediately. 1154 00:59:55,695 --> 00:59:58,800 1155 00:59:58,800 --> 01:00:04,470 When that happens once to you, you got burnt once, you don't 1156 01:00:04,469 --> 01:00:06,139 ever want to get burnt again. 1157 01:00:06,139 --> 01:00:08,769 Maybe your projects aren't that important. 1158 01:00:08,769 --> 01:00:12,219 My projects, they're important. 1159 01:00:12,219 --> 01:00:14,779 There's a reason I care. 1160 01:00:14,780 --> 01:00:17,470 This is also one of the reasons to go back to the 1161 01:00:17,469 --> 01:00:20,289 distribution angle a bit. 1162 01:00:20,289 --> 01:00:23,670 When you do Google, for example, Google code, you have 1163 01:00:23,670 --> 01:00:28,099 your source repositories that you help people maintain, and 1164 01:00:28,099 --> 01:00:29,449 I think you do so under Subversion. 1165 01:00:29,449 --> 01:00:32,750 1166 01:00:32,750 --> 01:00:36,030 I would never, ever trust Google to maintain my source 1167 01:00:36,030 --> 01:00:37,130 code for me. 1168 01:00:37,130 --> 01:00:38,030 I'm sorry. 1169 01:00:38,030 --> 01:00:39,325 You're just not that trustworthy. 1170 01:00:39,324 --> 01:00:42,029 1171 01:00:42,030 --> 01:00:46,820 The reason I really prefer a distributed system is I can 1172 01:00:46,820 --> 01:00:51,600 keep my source code behind three firewalls on a system 1173 01:00:51,599 --> 01:00:55,529 that does not allow SSH in at all. 1174 01:00:55,530 --> 01:00:59,450 When I'm here, I cannot read my email because my email goes 1175 01:00:59,449 --> 01:01:03,549 onto my machine, and the only way I can get into that 1176 01:01:03,550 --> 01:01:06,390 machine is when I'm physically on that network. 1177 01:01:06,389 --> 01:01:09,779 So maybe I'm cuckoo, maybe I'm a bit crazy and I care about 1178 01:01:09,780 --> 01:01:12,330 security more than most people do. 1179 01:01:12,329 --> 01:01:17,699 But this whole notion that I would give the master copy of 1180 01:01:17,699 --> 01:01:21,259 source code that I trust and I care about so much, and I 1181 01:01:21,260 --> 01:01:25,080 would give it to a third party is ludicrous. 1182 01:01:25,079 --> 01:01:29,170 Not even Google, not a way in hell would I do that. 1183 01:01:29,170 --> 01:01:33,159 I allow Google to have a copy of it, but I want to have 1184 01:01:33,159 --> 01:01:36,929 something that I know nobody touched. 1185 01:01:36,929 --> 01:01:39,884 And by the way, I'm not a great MIS person, so the disk 1186 01:01:39,885 --> 01:01:43,310 corruption issue is definitely a case that I might worry 1187 01:01:43,309 --> 01:01:45,190 about because I don't do backups. 1188 01:01:45,190 --> 01:01:50,639 So it's OK if I can then download it again from 1189 01:01:50,639 --> 01:01:52,339 multiple trusted parties. 1190 01:01:52,340 --> 01:01:54,680 I can verify them against each other, that 1191 01:01:54,679 --> 01:01:55,699 part is really easy. 1192 01:01:55,699 --> 01:01:58,789 I can verify them against hopefully that 20 bytes that I 1193 01:01:58,789 --> 01:02:00,239 really, really cared about. 1194 01:02:00,239 --> 01:02:02,459 Hopefully I have that in a few places. 1195 01:02:02,460 --> 01:02:07,199 20 bytes is easier to track than 180 megabytes and 1196 01:02:07,199 --> 01:02:09,439 corruption is less likely to hit those 20 bytes. 1197 01:02:09,440 --> 01:02:14,039 If I have those 20 bytes, I can download a Git repository 1198 01:02:14,039 --> 01:02:19,389 form a completely untrusted source and I can guarantee 1199 01:02:19,389 --> 01:02:22,139 that they didn't do anything bad to it. 1200 01:02:22,139 --> 01:02:24,519 That's a huge thing, and that's something that when you 1201 01:02:24,519 --> 01:02:28,949 do hosted repositories for other people, if you use 1202 01:02:28,949 --> 01:02:32,259 Subversion you're just not doing it right. 1203 01:02:32,260 --> 01:02:36,230 You're not allowing them to sleep well at night. 1204 01:02:36,230 --> 01:02:42,289 Of course, if you do it for 75,000 projects, most of them 1205 01:02:42,289 --> 01:02:43,559 are probably pretty small and not very 1206 01:02:43,559 --> 01:02:44,809 important, so it's OK. 1207 01:02:44,809 --> 01:02:47,420 1208 01:02:47,420 --> 01:02:50,599 That should make people feel better. 1209 01:02:50,599 --> 01:02:51,659 I have a few more slides. 1210 01:02:51,659 --> 01:02:54,609 I think we're over time. 1211 01:02:54,610 --> 01:02:56,860 I'm not even going to bother showing them, they're not that 1212 01:02:56,860 --> 01:02:58,110 interesting I think. 1213 01:02:58,110 --> 01:03:00,820 1214 01:03:00,820 --> 01:03:07,630 I talked a bit about this, about content versus 1215 01:03:07,630 --> 01:03:08,840 individual files. 1216 01:03:08,840 --> 01:03:10,090 Git tracks content. 1217 01:03:10,090 --> 01:03:12,940 1218 01:03:12,940 --> 01:03:18,240 There is the only sample command line in the whole 1219 01:03:18,239 --> 01:03:19,549 presentation. 1220 01:03:19,550 --> 01:03:24,430 Gitk is the graphical viewer of history of a Git project. 1221 01:03:24,429 --> 01:03:28,279 It's a [UNINTELLIGIBLE] 1222 01:03:28,280 --> 01:03:32,980 script that is really only doing viewing of stuff that 1223 01:03:32,980 --> 01:03:35,889 Git is really good at showing you. 1224 01:03:35,889 --> 01:03:39,559 And this is the kind of command line I use as a 1225 01:03:39,559 --> 01:03:41,309 top-level maintainer. 1226 01:03:41,309 --> 01:03:46,759 I want to be able to say what changed since a particular 1227 01:03:46,760 --> 01:03:49,270 version, maybe since a particular date, I can do that 1228 01:03:49,269 --> 01:03:53,250 easily, in those two directories or in those two 1229 01:03:53,250 --> 01:03:55,420 directories and that file. 1230 01:03:55,420 --> 01:04:01,010 And what this will show me is the global history as it 1231 01:04:01,010 --> 01:04:04,480 pertains to those parts of the repository. 1232 01:04:04,480 --> 01:04:08,920 It is more expensive to compute than the global, 1233 01:04:08,920 --> 01:04:16,180 global history, but if my laptop was actually connected 1234 01:04:16,179 --> 01:04:18,569 to the A/V system, I could show you. 1235 01:04:18,570 --> 01:04:22,430 Even on that laptop, it comes up in seconds. 1236 01:04:22,429 --> 01:04:24,349 It is that expensive, but we are that good. 1237 01:04:24,349 --> 01:04:27,110 1238 01:04:27,110 --> 01:04:29,470 This is something that is really, really unique to get. 1239 01:04:29,469 --> 01:04:30,879 Nobody else can do it. 1240 01:04:30,880 --> 01:04:34,119 And it's a hugely important feature. 1241 01:04:34,119 --> 01:04:37,299 Maybe it's not so important to individual developers because 1242 01:04:37,300 --> 01:04:41,070 individual developers often do think in terms of single 1243 01:04:41,070 --> 01:04:48,590 files, but it is important for the people who merge stuff, it 1244 01:04:48,590 --> 01:04:52,059 is important for people like me and the people I work with 1245 01:04:52,059 --> 01:04:55,199 directly because they never basically care 1246 01:04:55,199 --> 01:04:56,230 about a single file. 1247 01:04:56,230 --> 01:04:58,500 And they do care about these kinds of features. 1248 01:04:58,500 --> 01:05:01,820 Somebody sends a bug report, which bug reports are usually 1249 01:05:01,820 --> 01:05:02,760 not very good. 1250 01:05:02,760 --> 01:05:05,530 But maybe the bug report is good enough that you can 1251 01:05:05,530 --> 01:05:08,150 pinpoint, OK, scuzzy subsystem. 1252 01:05:08,150 --> 01:05:10,090 That's the command line. 1253 01:05:10,090 --> 01:05:14,700 You can't say which file, but you can do this and say, OK, 1254 01:05:14,699 --> 01:05:21,230 that will cut it down from the 15,000 commits we've had since 1255 01:05:21,230 --> 01:05:24,619 last week, it will cut it down to 50. 1256 01:05:24,619 --> 01:05:26,380 That's a huge deal. 1257 01:05:26,380 --> 01:05:30,039 That is something that nobody else can do, I guarantee you. 1258 01:05:30,039 --> 01:05:32,619 1259 01:05:32,619 --> 01:05:35,000 So that's the reason you want to use Git. 1260 01:05:35,000 --> 01:05:36,170 That's what it all boils down to. 1261 01:05:36,170 --> 01:05:39,430 It's safe, it is so fast that you can do things that nobody 1262 01:05:39,429 --> 01:05:43,769 else can do, it does things that nobody else can do, even 1263 01:05:43,769 --> 01:05:46,940 slowly, and it's distributed. 1264 01:05:46,940 --> 01:05:49,179 So go and spread the word. 1265 01:05:49,179 --> 01:05:50,849 We have one more question, I guess. 1266 01:05:50,849 --> 01:05:51,980 What is the timing like? 1267 01:05:51,980 --> 01:05:53,690 I don't know. 1268 01:05:53,690 --> 01:05:55,159 AUDIENCE: Quickly. 1269 01:05:55,159 --> 01:05:59,079 So one of the reasons why we would switch from Perforce is 1270 01:05:59,079 --> 01:06:01,500 release capability and performance. 1271 01:06:01,500 --> 01:06:05,710 Otherwise, people would just say, keep using it. 1272 01:06:05,710 --> 01:06:08,869 Would we be exchanging one set of scalability performance 1273 01:06:08,869 --> 01:06:13,719 problems for other scalability performance problems? 1274 01:06:13,719 --> 01:06:16,449 LINUS TORVALDS: I already mentioned the fact that I 1275 01:06:16,449 --> 01:06:20,259 don't know how you maintain stuff in Perforce, but when 1276 01:06:20,260 --> 01:06:23,370 and if you do a switchover to Git, what you want to make 1277 01:06:23,369 --> 01:06:26,809 sure is because of this content model, you need to do 1278 01:06:26,809 --> 01:06:30,509 it at same content boundaries. 1279 01:06:30,510 --> 01:06:32,890 The content boundaries usually are actually pretty 1280 01:06:32,889 --> 01:06:33,710 self-obvious. 1281 01:06:33,710 --> 01:06:35,000 I mean, they really are. 1282 01:06:35,000 --> 01:06:39,730 You have the compiler, you have the main source, you have 1283 01:06:39,730 --> 01:06:41,230 the documentation. 1284 01:06:41,230 --> 01:06:43,880 Well, you probably have the documentation spread out, but 1285 01:06:43,880 --> 01:06:48,030 you may have some user visible documentation. 1286 01:06:48,030 --> 01:06:49,190 Or maybe Google doesn't. 1287 01:06:49,190 --> 01:06:52,579 But a lot of companies have a separate set of documentation 1288 01:06:52,579 --> 01:06:55,009 that they give to customers, and then they have the 1289 01:06:55,010 --> 01:06:58,630 documentation that goes into each individual package, is 1290 01:06:58,630 --> 01:07:00,590 package-based. 1291 01:07:00,590 --> 01:07:04,250 So one of the things you do have to think about with Git 1292 01:07:04,250 --> 01:07:09,079 is you want to make sure it is in a somewhat sane hierarchy. 1293 01:07:09,079 --> 01:07:11,750 1294 01:07:11,750 --> 01:07:14,690 Git can easily handle larger projects. 1295 01:07:14,690 --> 01:07:20,090 You can have 10,000 files and that's not a problem. 1296 01:07:20,090 --> 01:07:25,850 The kernel is 22, we've done tests with 100, it's fine. 1297 01:07:25,849 --> 01:07:28,480 It's faster than anything else. 1298 01:07:28,480 --> 01:07:33,440 With a million files, I suspect other systems will be 1299 01:07:33,440 --> 01:07:35,289 faster at some things. 1300 01:07:35,289 --> 01:07:36,849 And that's the kind of situation I don't 1301 01:07:36,849 --> 01:07:38,819 want you to get into. 1302 01:07:38,820 --> 01:07:42,950 But if you do that basic setup correctly, it will be 1303 01:07:42,949 --> 01:07:47,429 basically faster at pretty much everything, than anything 1304 01:07:47,429 --> 01:07:48,539 anybody else will. 1305 01:07:48,539 --> 01:07:52,009 I am very confident about Git performance. 1306 01:07:52,010 --> 01:07:56,185 One of the things we don't necessarily do really well is 1307 01:07:56,184 --> 01:07:58,409 the CVS Annotate. 1308 01:07:58,409 --> 01:08:01,799 People use CVS Annotate a lot if they use CVS. I'm told it 1309 01:08:01,800 --> 01:08:05,380 sucks under Perforce, too, so you probably don't use the 1310 01:08:05,380 --> 01:08:09,280 Perforce version of Annotate, I'm not sure. 1311 01:08:09,280 --> 01:08:11,660 But CVS users are used to CVS Annotate. 1312 01:08:11,659 --> 01:08:15,519 It's the one operation that CVS can do faster than Git, 1313 01:08:15,519 --> 01:08:19,140 because CVS does track things one file at a time. 1314 01:08:19,140 --> 01:08:20,069 Git doesn't. 1315 01:08:20,069 --> 01:08:26,920 Git has an annotate, but if you moved a function from one 1316 01:08:26,920 --> 01:08:32,359 file to another, Git will literally tell you the history 1317 01:08:32,359 --> 01:08:35,010 of that function even across that move. 1318 01:08:35,010 --> 01:08:38,970 Not to file move, a function within a file. 1319 01:08:38,970 --> 01:08:41,970 It will go and dig back and say, hey, those two lines 1320 01:08:41,970 --> 01:08:44,689 actually came from that other file five years ago. 1321 01:08:44,689 --> 01:08:47,140 That is, again, something nobody else can do and it 1322 01:08:47,140 --> 01:08:48,560 boils down to the same thing. 1323 01:08:48,560 --> 01:08:51,380 It's the content that matter, it's not actually the files. 1324 01:08:51,380 --> 01:08:54,350 But it does make it a much more expensive operation, so 1325 01:08:54,350 --> 01:08:57,620 if you go back five years maybe it takes 30 seconds. 1326 01:08:57,619 --> 01:09:01,460 On the kernel, it takes a second for any file I have. We 1327 01:09:01,460 --> 01:09:05,220 started from no history two years ago because we just made 1328 01:09:05,220 --> 01:09:08,079 the decision that let's not make it more complicated than 1329 01:09:08,079 --> 01:09:08,659 it needs to be. 1330 01:09:08,659 --> 01:09:10,579 So right now, we only have two years of 1331 01:09:10,579 --> 01:09:12,359 history in the kernel. 1332 01:09:12,359 --> 01:09:16,149 We have more history in other projects that we've done 1333 01:09:16,149 --> 01:09:16,930 timings on. 1334 01:09:16,930 --> 01:09:20,700 So we've done timings on importing the KDE and things 1335 01:09:20,699 --> 01:09:21,949 like that with more history. 1336 01:09:21,949 --> 01:09:25,470 1337 01:09:25,470 --> 01:09:30,164 There are performance issues, but most of them are, Git is 1338 01:09:30,164 --> 01:09:32,229 one or two orders of magnitude faster. 1339 01:09:32,229 --> 01:09:36,520 So most of them are the good kind. 1340 01:09:36,520 --> 01:09:39,359 And if you find something, we actually have a really, really 1341 01:09:39,359 --> 01:09:40,640 good community. 1342 01:09:40,640 --> 01:09:45,810 The Git mailing list is fairly high signal to noise. 1343 01:09:45,810 --> 01:09:49,039 It does get a fair amount of emails, but it's actually a 1344 01:09:49,039 --> 01:09:52,970 very pleasant mailing list. If anybody is interested, read 1345 01:09:52,970 --> 01:09:55,440 the sources first, but start looking at the 1346 01:09:55,439 --> 01:09:58,119 mailing list archives. 1347 01:09:58,119 --> 01:10:00,739 We have our flames, we have are pointless discussions, but 1348 01:10:00,739 --> 01:10:02,170 most of it is actually very good. 1349 01:10:02,170 --> 01:10:05,190 1350 01:10:05,189 --> 01:10:06,689 OK. 1351 01:10:06,689 --> 01:10:07,939 Thanks. 1352 01:10:07,939 --> 01:10:14,619