facebook的工程師文化

有人發表了How Facebook Ships Code,偶覺得其中關於Facebook的工程師驅動文化的部分特別有意思,於是翻譯了一下(剛剛google之,網上也有其他翻譯出來了,真是快手啊)..

* as of June 2010, the company has nearly 2000 employees, up from roughly 1100 employees 10 months ago. Nearly doubling staff in under a year!
截止2010年6月,fb有大概2000名員工,比之前的10個月,增加了將近1000名

* the two largest teams are Engineering and Ops, with roughly 400-500 team members each. Between the two they make up about 50% of the company.
最大的兩個團隊是工程團隊,和運維團隊,各有400-500名工程師

* product manager to engineer ratio is roughly 1-to-7 or 1-to-10
產品經理和工程師的比例大約是1:7到1:10之間

* all engineers go through 4 to 6 week “Boot Camp” training where they learn the Facebook system by fixing bugs and listening to lectures given by more senior/tenured engineers. estimate 10% of each boot camp’s trainee class don’t make it and are counseled out of the organization.
新入職的工程師大概會進行一個4-6周的BootCamp訓練來熟悉fb,修補bug,以及學習來自資深工程師的訓練課程;大概10%的新兵無法完成這個過程被勸退

* after boot camp, all engineers get access to live DB (comes with standard lecture about “with great power comes great responsibility” and a clear list of “fire-able offenses”, e.g., sharing private user data)
BootCamp後,所有的工程師就可以去訪問生產系統(DB)了——這裏有一個文化"給員工越多授權,他們的責任心越高"——以及一系列明確的不能去做的禁令,比如,公開用戶私人信息

* [EDIT thx fryfrog] “There are also very good safe guards in place to prevent anyone at the company from doing the horrible sorts of things you can imagine people have the power to do being on the inside. If you have to “become” someone who is asking for support, this is logged along with a reason and closely reviewed. Straying here is not tolerated, period.”

* any engineer can modify any part of FB’s code base and check-in at-will
任何工程師可以修改FB代碼庫裏的任何部分

* very engineering driven culture. ”product managers are essentially useless here.” is a quote from an engineer. engineers can modify specs mid-process, re-order work projects, and inject new feature ideas anytime.
極致的工程師文化——某工程師如此評價:"產品經理完全可以忽視鄙視無視"。流程執行到一半的時候工程師還能去修改規格,工程師有權利調整項目優先級,任何時刻插入自己新的idea

* during monthly cross-team meetings, the engineers are the ones who present progress reports. product marketing and product management attend these meetings, but if they are particularly outspoken, there is actually feedback to the leadership that “product spoke too much at the last meeting.” they really want engineers to publicly own products and be the main point of contact for the things they built.
在月度跨部門會議裏,工程師負責做進度報告。產品營銷和產品經理也會去參加這些會議,但如果他們particularly outspoken,領導層會得到這樣的反饋"產品在上個會議講的太多了"。這裏期望工程師擁有產品並且成爲他們項目的主角

* resourcing for projects is purely voluntary.
o a PM lobbies group of engineers, tries to get them excited about their ideas.
o Engineers decide which ones sound interesting to work on.
o Engineer talks to their manager, says “I’d like to work on these 5 things this week.”
o Engineering Manager mostly leaves engineers’ preferences alone, may sometimes ask that certain tasks get done first.
o Engineers handle entire feature themselves — front end javascript, backend database code, and everything in between. If they want help from a Designer (there are a limited staff of dedicated designers available), they need to get a Designer interested enough in their project to take it on. Same for Architect help. But in general, expectation is that engineers will handle everything they need themselves.
項目的資源完全來自工程師的自願:

  • PM遊說工程師們,試圖吸引工程師爲他們的想法而工作
  • 工程師自己決定去幹哪個產品經理的活
  • 工程師然後去給他們的頭兒報告:"我本週要幹這麼5件事情"
  • 工程師的頭兒幾乎可以說是放任手下各行其是,偶爾給點做事情優先級的忠告
  • 工程師自己處理所有的事情,從js到db的所有邏輯。如果他們期望得到設計師(FB裏只有非常少的專職設計師)的幫助,他們需要自己去搞定設計師來加入他們的項目;如果需要架構師同樣也得自己來搞定。但通常來說,工程師自己幹所有的活

* arguments about whether or not a feature idea is worth doing or not generally get resolved by just spending a week implementing it and then testing it on a sample of users, e.g., 1% of Nevada users.
關於某個特性是否值得去做,基本上不花時間去爭執。幹一個星期的活,然後放給一小部分用戶羣(比如1%的內華達州用戶)去測試來決定

* engineers generally want to work on infrastructure, scalability and “hard problems” — that’s where all the prestige is. can be hard to get engineers excited about working on front-end projects and user interfaces. this is the opposite of what you find in some consumer businesses where everyone wants to work on stuff that customers touch so you can point to a particular user experience and say “I built that.” At facebook, the back-end stuff like news feed algorithms, ad-targeting algorithms, memcache optimizations, etc. are the juicy projects that engineers want.
工程師一般來說都比較喜歡做做基礎架構,高負載高併發,所謂"真正的技術難題"...等等漲聲望值的東西。很難讓一個工程師對用戶界面修修補補而燃燒熱情。在某些做consumer business的企業相反:每個人都希望做那些影響用戶體驗的事情這樣他們就可以指着網頁某處說:"介四俺做滴"。在FB,後端的工作比如newsfeed算法,廣告精準投遞算法,memcached優化,就是工程師最希望去做的事情(qyb:這一段不好翻譯,誰能告訴我什麼是 juicy project??)

* commits that affect certain high-priority features (e.g., news feed) get code reviewed before merge. News Feed is important enough that Zuckerberg reviews any changes to it, but that’s an exceptional case.
對那些高敏感度功能的代碼提交,合併之前肯定要做codereview. News Feed 是最重要的部分,Zuckerberg 會親自審查修改它的所有更改
* [CORRECTION -- thx epriest] “There is mandatory code review for all changes (i.e., by one or more engineers). I think the article is just saying that Zuck doesn’t look at every change personally.”
* [CORRECTION thx fryfrog] “All changes are reviewed by at least one person, and the system is easy for anyone else to look at and review your code even if you don’t invite them to. It would take intentionally malicious behavior to get un-reviewed code in.”

* no QA at all, zero. engineers responsible for testing, bug fixes, and post-launch maintenance of their own work. there are some unit-testing and integration-testing frameworks available, but only sporadically used.
FB沒有QA,真的就是零個. 工程師負責測試,修補錯誤,發佈後的維護。確實也有個單元測試集成測試框架,但很少被使用
* [CORRECTION thx fryfrog] “I would also add that we do have QA, just not an official QA group. Every employee at an office or connected via VPN is using a version of the site that includes all the changes that are next in line to go out. This version is updated frequently and is usually 1-12 hours ahead of what the world sees. All employees are strongly encouraged to report any bugs they see and these are very quickly actioned upon.”
"必須說FB是有QA的,只不過沒有一個正式的QA團隊。每個員工在內網使用系統的測試版本。版本經常升級,通常內部使用1-12個小時後就被髮布到生產系統。強烈鼓勵每個僱員去報告任何他們碰到的問題,這些問題也都飛快的得到響應"

* re: surprise at lack of QA or automated unit tests — “most engineers are capable of writing bug-free code. it’s just that they don’t have an incentive to do so at most companies. when there’s a QA department, it’s easy to just throw it over to them to find the errors.” [EDIT: please note that this was subjective opinion, I chose to include it in this post because of the stark contrast that this draws with standard development practice at other companies]
* [CORRECTION thx epriest] “We have automated testing, including “push-blocking” tests which must pass before the release goes out. We absolutely do not believe “most engineers are capable of writing bug-free code”, much less that this is a reasonable notion to base a business upon.”
"FB有自動測試,包括一旦出錯就無法release的測試集合。我們完全不認同所謂'FB的絕大多數工程師有能力寫出無錯代碼'這類提法",至少從商業風險的角度我們不會這麼傲慢

* re: surprise at lack of PM influence/control — product managers have a lot of independence and freedom. The key to being influential is to have really good relationships with engineering managers. Need to be technical enough not to suggest stupid ideas. Aside from that, there’s no need to ask for any permission or pass any reviews when establishing roadmaps/backlogs. ”My product director doesn’t even really know all the things I have on my roadmap.” There are relatively few PMs, but they all feel like they have responsibility for a really important and personally-interesting area of the company.
re: 缺乏產品經理來影響/控制項目好像有點奇怪——但是產品經理有非常大的獨立性和自由度。PM擁有影響力的關鍵是和工程經理搞好關係。產品經理需要有足夠的技術頭腦,別提傻想法,除了這點,產品經理制定其路線圖的時候無需任何權限和額外許可。"我的產品主管並不完全瞭解我想做什麼"。只有很少的產品經理,但所有PM們都很有責任心的去做那些真正重要,以及個人最感興趣的部分。

* by default all code commits get packaged into weekly releases (tuesdays)
缺省所有的代碼提交集成在一個周發佈裏(週二)

* with extra effort, changes can go out same day
通過額外的努力,提交也許可以被當天發佈

* tuesday code releases require all engineers who committed code in that week’s release candidate to be on-site
週二的發佈要求所有提交到候選版本里工程師都到場待命

* engineers must be present in a specific IRC channel for “roll call” before the release begins or else suffer a public “shaming”
在發佈之前,工程師們必須在內部IRC裏待命準備點名

* ops team runs code releases by gradually rolling code out
o facebook has around 60,000 servers
o there are 9 concentric levels for rolling out new code
o [CORRECTION thx epriest] “The nine push phases are not concentric. There are three concentric phases (p1 = internal release, p2 = small external release, p3 = full external release). The other six phases are auxiliary tiers like our internal tools, video upload hosts, etc.”
o the smallest level is only 6 servers
o e.g., new tuesday release is rolled out to 6 servers (level 1), ops team then observes those 6 servers and make sure that they are behaving correctly before rolling forward to the next level.
o if a release is causing any issues (e.g., throwing errors, etc.) then push is halted. the engineer who committed the offending changeset is paged to fix the problem. and then the release starts over again at level 1.
o so a release may go thru levels repeatedly: 1-2-3-fix. back to 1. 1-2-3-4-5-fix. back to 1. 1-2-3-4-5-6-7-8-9.
運維團隊運行代碼,逐步的將代碼發佈給所有人

  • FB有大概6w臺服務器
  • 發佈要分3個階段:p1=內部發布、p2=小規模外部發布,p3=完全外部發布. 關於一些外圍系統比如視頻上載什麼的被劃到了另外6個發佈階段。一共是從p1到p9
  • 最小的發佈級別隻影響到6臺服務器(qyb:我猜這意思是FB只要有6臺服務器就可以運行所有的服務)
  • 週二發佈就是p1,運維團隊觀察這6臺服務器的運行情況,然後開始向下一個級別進行發佈
  • 如果某個發佈造成了錯誤. 整個進程就會中止. 提交相關代碼的工程師會被叫過來修補代碼,然後,再次從p1開始

* ops team is really well-trained, well-respected, and very business-aware. their server metrics go beyond the usual error logs, load & memory utilization stats — also include user behavior. E.g., if a new release changes the percentage of users who engage with Facebook features, the ops team will see that in their metrics and may stop a release for that reason so they can investigate.
運維團隊非常。。。牛B閃閃。。。他們的控制面板上不僅僅有錯誤日誌、系統負載、內存佔用,他們還計算用戶行爲。如果某個發佈後導致FB用戶的某項行爲特徵的百分比也有所變化,控制面板上就會顯示出來,然後他們就會中止發佈,然後去尋找原因

* during the release process, ops team uses an IRC-based paging system that can ping individual engineers via Facebook, email, IRC, IM, and SMS if needed to get their attention. not responding to ops team results in public shaming.
在發佈過程裏,運維團隊隨時通過IRC去呼叫工程師。沒有及時迴應運維團隊的開發者會被公開批判

* once code has rolled out to level 9 and is stable, then done with weekly push.
一旦發佈完成了p9,本週發佈就算結束了

* if a feature doesn’t get coded in time for a particular weekly push, it’s not that big a deal (unless there are hard external dependencies) — features will just generally get shipped whenever they’re completed.

* getting svn-blamed, publicly shamed, or slipping projects too often will result in an engineer getting fired. ”it’s a very high performance culture”. people that aren’t productive or aren’t super talented really stick out. Managers will literally take poor performers aside within 6 months of hiring and say “this just isn’t working out, you’re not a good culture fit”. this actually applies at every level of the company, even C-level and VP-level hires have been quickly dismissed if they aren’t super productive.
被svn-blamed(qyb:我猜測svn-blamed的意思是某人提交了一個特別弱智的bug,然後被svn blame命令檢出這次提交的作者信息貼在內部郵件組裏...也許FB定期公佈這些工程師名單,被稱之爲svn-blamed),被公開批判的,項目常常延期。。。。這些過失都會導致被解僱。"這裏有一個高績效文化",不優秀的生產力不高的會被清除出去。新員工入職半年後如果表現不佳,就會被經理告知"這裏不合適你". 甚至對於C級,vp級員工如果沒有達到更高的預期也會被立即解僱.(qyb:看起來Mark之下只有4級,A/B/C/VP)

* [CORRECTION, thx epriest] “People do not get called out for introducing bugs. They only get called out if they ask for changes to go out with the release but aren’t around to support them in case something goes wrong (and haven’t found someone to cover for you).”
"如果只是寫出了bug,工程師不會被公開點名。但要是發佈出問題被要求支持的時候不在現場或者自己也沒能找個替班的人,那就會被點名了"
* [CORRECTION, thx epriest] “Getting blamed will NOT get you fired. We are extremely forgiving in this respect, and most of the senior engineers have pushed at least one horrible thing, myself included. As far as I know, no one has ever been fired for making mistakes of this nature.”
"被svn-blamed的並不會被導致解僱。我們還是很寬大的。即使是資深開發工程師,大多數也避免不了被blamed,包括我。據我所知,沒有人因爲這種情況而被解僱"
* [CORRECTION, thx fryfrog] “I also don’t know of anyone who has been fired for making mistakes like are mentioned in the article. I know of people who have inadvertently taken down the site. They work hard to fix what ever caused the problem and everyone learns from it. The public shaming is far more effective than fear of being fired, in my opinion.”

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章