Gluster blog stories provide high-level spotlights on our users all over the world
This post is about parsimonious planning, back-of-the-envelope engineering, and extreme productivity.
|By back of the envelope, we don’t mean flow charting. Rather, we mean real world consideration of the quantitative aspects of all of the major segments in a system. Notice that there is little mention of accidental aspects of the system (i.e. Java, C++, AWS, Java, Heroku, …). The envelope doesn’t document technology – rather, it tells us about the actual problem we are solving in a largely technology independent manner.
This post is not about brain teasers: Rather, its about spurring your ability to implement envelope-driven-development in the real world, quickly.
I’m not a hardware guy, not at all. But in any case, I’m going to (attempt to) show you how to quickly walk through 3 simple estimation problems which test your ability to estimate the necessary CPU, I/O, etc… resources for maintaining a simple web site. Of course, the example site can’t be too simple – otherwise, we wouldn’t learn anything. So… we’ll use a “community forum” as the template (i.e. the website is dynamic, and grows over time).
But first, especially if you’re new to this: to get better at back-of-the-envelope calculations – I suggest spending some time on these websites:
Prototypes vs Envelopes
Before we begin, lets consider the benefits of envelope-driven development in light of the infamous “prototype”. We’ve all built a prototype before, and we have probably learned that prototypes don’t always tell us very much about the way our real world application will scale. In addition, they take a while, and involve alot of boilerplate coding that doesn’t advance our intuition about the target system we’re trying to build.
|Data Scale||TB of data and beyond||MB of data|
|Time||Days or less||Weeks, Months|
|Deliverables||Scalability estimates, hot spots for further research. Quantitative metrics for moving forward.||“Working” code (grain of salt!) that usually doesn’t really work at all.|
|Cost($)||100s-1000s||1000s (novice built), 10K-50K+ (expert built)|
So there you have it: Our back of the envelope architecture deals with large data sets (by abstracting them), in a shorter period of time, without overpromising. The only real advantage to building a prototype as a first go is if you really, really, really must see working code for some reason (i.e. you are evaluating the competency of the people building your system, rather than focusing on the system itself). It this is the case – this article is not for you. This post is for those who know, from the get-go, that the system they are building is of primary importance, having direct, immediate value… So, here’s how this is gonna work:
Now: lets say you are building a web-forum, you know, like http://www.skateboard-city.com/ or something… And you want to estimate how big your database might need to get, and its growth rate (so that you can decide where to host it, what scalability issues you might have, etc).
Assuming, say 1000 users, with 10 posts a day… What kind of computing infrastructure do you need to manage the data ?
We can easily simplify the question by bounding this question, by again being parsimonious… Lets just put a cap of 100GB on the amount of disk space. It makes the problem extremely tractable. We know that any modern hard drive can easily handle that.
So the question now becomes: will your forum-thingy ever exceed 100GB of data?
If not, clearly, its data can be contained a small linux box under your desk. Armed with nothing other than the above data points, we can safely (and confidently) determine a back-of-the-envelope answer to this question : with no coding required.
Assumption: Lets assume that a typical forum post might be a paragraph ~ thats about 4 sentences. Okay… So how much data is that?
Your growth rate is about 240 bytes per day, times 10, which amounts to about 2.4 kilo bytes per day, which is about 3/1000 Megabytes. So, it will take you a year to hit the MB mark, and 1000 years to reach the gigabyte mark. Which means that you can save your entire forum on a a tiny hardrive, or even, a USB thumb drive, for years to come (we can safely assume that you also have space to save the basic static content of the site, as well, which won’t be growing).
The answer: The 100GB limit won’t be reached for quite some time. Your safe hosting this on a rinky-dink machine that sits underneath your desktop. At least, your safe in terms of disk space.
Okay : So the last question was pretty easy. 10 posts a day isn’t that much. We know that we can host our fictional forum’s core database, which grows reasonably slowly, on a cheap box under our desktop. But, can we handle the traffic, i.e., the i/o ?
Now, again, here are the freebies:
What is the cost (in terms of bandwidth), for hosting 10,000 hits ? First, lets assume that each page will have 2.4 KB of content. But… since we know that the data on our site will be an underestimate of a page’s size, we might want to bump that up to 10KB, per page, just to be safe.
Is that a good estimate?
We can easily find out by firing up the terminal and checking:
find ./ -name *html | xargs ls -altrh
Again, we’ve just “sureyed” a reasonably sized sample of html pages with no “real” coding required.
On my laptop, most of the HTML files are, in fact, between 10 and 50 KB. So… Lets bump it up another order of magnitude, just to be safe. We will estimate that our forum page sizes are 100KB a pop.
So, spreading 100KB over a period of 24 hours (admittedly, ignoring peak times), we need an upstream connection speed of 10,000*100KB/24 hours = 50,000 KB/ hour = 50 MB / hr.
Okay, so we need to stream around 50MB per hour. Can we easily do this on a small machine on a reasonably fast, conventional network?
Again, we can simplify this by being pessimistic: can our home internet connection handle that sort of load ? Easily. Without googleing for “Megabytes per hour, home internet”, we can estimate this as well. Since our Iphone is capable of uploading 1 MB in under a minute, its obvious that we can handle 60MB in under an hour. Problem solved.
Lets say we want to enable a search engine for our little blog site. How much extra disk space would we need to index all the text on the site, and how will we host the site?
This one is trickier because we need to factor in CPU cycles, Memory, and the architecture of a search engine. Here’s a really aweomse freebie for this exersize that you can tuck away:
A quick reality check : So: Let’s estimate how much memory this might take up: if each word is ~8 characters, at 8 bytes/character, we have about 64 bytes per word. 64 bytes/word * 10 million words ~ 640 million bytes, which is around 640 MB. The actual size of the first google search engines word index was around 256MB, which means they might have used some sort of compression to store the words… OR, that the average word length is less than 8 characters. A quick google search yields 5.10 as the average word length. What happens if we plug this into our estimate? We get slightly closer ~400 MB. So… maybe words on the web are smaller. We’re certainly close enough for this estimate, and its encouraging that, by adding more data to our equation, we get almost 50% closer to the “right” answer.
And, finally, here are the other basic ingredients that we will need to drum up a badass estimate for our now fictional, searchable forum:
Finally! The question:
Do we need to buy extra RAM or buy a multi-core machine CPU to host a performant search-engine for forum ?
Again, we start by translating the question: Will our search engine be CPU/Memory intensive? More specifically… Will our intel quad-core chip and 4GB machine be able to handle the load (lets assume 2GB is reserved for other applications, which means the server is effectively 2GB).
First, we can tackle the CPU cycles :
Another aside, just to be safe – lets estimate the amount of free CPU time which we expect to have for searching: To estimate if CPU load is going to be effected by incoming requests, we have: 10,000 (number of requests daily) / 24 (hour) ~ 50 req/hr. Assuming our web servers can handle one request in under 100 ms, this means that they, on average, will be spinning over requests for 5000ms, per hour, which is 6 seconds, or 1/600th of an hour, which means there should be more than enough CPU bandwidth to go around.
And as far as memory is concerned, that’s easy to 🙂
So… what have we learned from our little envelope friend ?
I had to practice this stuff for the dreaded google-interview once. Thats how I got started down this masochistic road. I didn’t get the job, I think because of the fact that I was struggling to find a recursive solution to the gray code problem… Or something like that… But I learned ton about on the fly estimating when preparing for it, and I got to eat ice cream out of some kind of robot-ice-cream-truck thingy, while witnessing my brain ooze out of the sides of my ears. So I think it was a worthwhile experience.
But oddly enough: they never really even asked me any estimation questions.
I think estimation questions aren’t particularly popular at a lot of software companies… I can’t, for the life of me, imagine why. It seems like estimation is the single most important aspect of our day to day lives as developers. I think this might be because most programmers secretly hate non-discretized (i.e. continuous) functions, which tend to be really important when doing back-of-the-envelope calculations. Or maybe most programmers just don’t like envelopes? Or maybe we, as developers, just like building new stuff so damn much that we would rather do extra work than be parsimonious about executing things efficiently. In any case, even if you like doing extra work… You can still benefit from parsimony: It frees you up to work on truly interesting problems, by decreasing the amount of time that you waste spinning your wheels on lukewarm, prototypical, not-quite-there-yet code.
Now its your turn!!!!!!!!!!!!!!!!!!!!!
Now that you’ve been conviced that back-of-the-envelope engineering is awesome, fun, and valuable, I’m sure you’re wondering: Where can I find more problems? Harder ones? How can I become an expert in this stuff?
Well… when I was in grad school, I remember a guy named Marty Schiller – he was a protein bioinformatician who did real experiments on real cells… and he used to back-of-the-envelope every experiment we did. Why? Because real experiments cost money, and his wife always wanted him to be home on time. He was a parsimonious bastard! And he got good at parsimony by applying it, every day, to his own work.
So… Want to be an envelope expert? Be like marty : Just take 20 minutes every morning to estimate the amount of CPU, memory, and disk I/O resources that your days work will be using. You can even do this if your not an engineer : just use rough benchmarks, i.e., assume that actively browsing the web takes up 3% of the memory on your computer, and that watching a YouTube video consumes 15% of your network download bandwidth, etc… Or better yet, use htop, along with your task manager, to actually see what kind of resources you use for common tasks.
The fringe benefits of this sort of thinking are reaped immediately. You’ll be amazed at the effects this have on your work. You will suddenly understand concurrency, non-blocking IO, database internals, and the underlying functionality of your platform much better, without ever having to buy a book about them. Why? Because when you start back-of-the-enveloping, you force yourself to understand the fundamentals of the system you are dealing with.
One last, final warning:
Don’t let your back of the envelope estimate turn into an executive summary. Really… if the envelope could predict the size of the human genome, convey the architecture of the worlds first laser light generator, and predict the magnitude of the first nuclear explosion (yes, in fact, these are 3 famous, landmark envelope predictions which were all within an order of magnitude) – its probably quite capable of describing the meat of your current software-architecture, regardless of the accidental complexities technolog(ies) involved.
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...