A Web 3.0: O trabalho “sexy” das próximas décadas

Artwork: Tamar Cohen, Andrew J Buboltz, 2011, silk screen on a page from a high school yearbook, 8.5″ x 12″

Prezados geonautas,

A Web 3.0 é sobre Big Data, a febre no mercado para a nova geração de profissionais, cientistas de dados, analistas de Big Data, programadores – habilidade universal-,  experts em dados, tirar informações de grandes quantidades de dados e transformar em produtos no mercado, que até então era mais visado em empresas startups de tecnologia, como Google, Linkedin, Facebook, Amazon, Microsoft, Walmart, eBay, Twitter.

Quem são esses profissionais?

Pessoas que tem uma mix de combinação em sua formação, extremamente raros, além de programadores, hacker de dados, analista, comunicador, consultor, ou seja, formação que ainda não existe nas universidades do mundo, cursos que estão começando a ser criados.

(…) “Hal Varian, economista-chefe do Google, é conhecido por ter dito: O trabalho sexy (valorizado) nos próximos 10 anos será os estatísticos. As pessoas pensam que eu estou brincando, mas quem teria imaginado que os engenheiros de computação teria sido o trabalho sexy da década de 1990? 

O artigo completo está em inglês:

Data Scientist: The Sexiest Job of the 21st Century, HBR Magazine, october 2012

by Thomas H. Davenport and D.J. Patil –

Download a free chapter from Thomas H. Davenport’s book Keeping Up with the Quants.

When Jonathan Goldman arrived for work in June 2006 at LinkedIn, the business networking site, the place still felt like a start-up. The company had just under 8 million accounts, and the number was growing quickly as existing members invited their friends and colleagues to join. But users weren’t seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently missing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”

Goldman, a PhD in physics from Stanford, was intrigued by the linking he did see going on and by the richness of the user profiles. It all made for messy data and unwieldy analysis, but as he began exploring people’s connections, he started to see possibilities. He began forming theories, testing hunches, and finding patterns that allowed him to predict whose networks a given profile would land in. He could imagine that new features capitalizing on the heuristics he was developing might provide value to users. But LinkedIn’s engineering team, caught up in the challenges of scaling up the site, seemed uninterested. Some colleagues were openly dismissive of Goldman’s ideas. Why would users need LinkedIn to figure out their networks for them? The site already had an address book importer that could pull in all a member’s connections.

Luckily, Reid Hoffman, LinkedIn’s cofounder and CEO at the time (now its executive chairman), had faith in the power of analytics because of his experiences at PayPal, and he had granted Goldman a high degree of autonomy. For one thing, he had given Goldman a way to circumvent the traditional product release cycle by publishing small modules in the form of ads on the site’s most popular pages.

Through one such module, Goldman started to test what would happen if you presented users with names of people they hadn’t yet connected with but seemed likely to know—for example, people who had shared their tenures at schools and workplaces. He did this by ginning up a custom ad that displayed the three best new matches for each user based on the background entered in his or her LinkedIn profile. Within days it was obvious that something remarkable was taking place. The click-through rate on those ads was the highest ever seen. Goldman continued to refine how the suggestions were generated, incorporating networking ideas such as “triangle closing”—the notion that if you know Larry and Sue, there’s a good chance that Larry and Sue know each other. Goldman and his team also got the action required to respond to a suggestion down to one click.

It didn’t take long for LinkedIn’s top managers to recognize a good idea and make it a standard feature. That’s when things really took off. “People You May Know” ads achieved a click-through rate 30% higher than the rate obtained by other prompts to visit more pages on the site. They generated millions of new page views. Thanks to this one feature, LinkedIn’s growth trajectory shifted significantly upward.

A New Breed

Goldman is a good example of a new key player in organizations: the “data scientist.” It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data. The title has been around for only a few years. (It was coined in 2008 by one of us, D.J. Patil, and Jeff Hammerbacher, then the respective leads of data and analytics efforts at LinkedIn and Facebook.) But thousands of data scientists are already working at both start-ups and well-established companies. Their sudden appearance on the business scene reflects the fact that companies are now wrestling with information that comes in varieties and volumes never encountered before. If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a “mashup” of several analytical efforts, you’ve got a big data opportunity.

Much of the current enthusiasm for big data focuses on technologies that make taming it possible, including Hadoop (the most widely used framework for distributed file system processing) and related open-source tools, cloud computing, and data visualization. While those are important breakthroughs, at least as important are the people with the skill set (and the mind-set) to put them to good use. On this front, demand has raced ahead of supply. Indeed, the shortage of data scientists is becoming a serious constraint in some sectors. Greylock Partners, an early-stage venture firm that has backed companies such as Facebook, LinkedIn, Palo Alto Networks, and Workday, is worried enough about the tight labor pool that it has built its own specialized recruiting team to channel talent to businesses in its portfolio. “Once they have data,” says Dan Portillo, who leads that team, “they really need people who can manage it and find insights in it.”

Who Are These People?

If capitalizing on big data depends on hiring scarce data scientists, then the challenge for managers is to learn how to identify that talent, attract it to an enterprise, and make it productive. None of those tasks is as straightforward as it is with other, established organizational roles. Start with the fact that there are no university programs offering degrees in data science. There is also little consensus on where the role fits in an organization, how data scientists can add the most value, and how their performance should be measured.

Thomas H. Davenport is a visiting professor at Harvard Business School, a senior adviser to Deloitte Analytics, and a coauthor of Judgment Calls (Harvard Business Review Press, 2012). D.J. Patil is the data scientist in residence at Greylock Partners, was formerly the head of data products at LinkedIn, and is the author of Data Jujitsu: The Art of Turning Data into Product (O’Reilly Media, 2012).

 Data Jujitsu: The art of turning data into product. Smart data scientists can make big problems small.

page 2

The first step in filling the need for data scientists, therefore, is to understand what they do in businesses. Then ask, What skills do they need? And what fields are those skills most readily found in?

More than anything, what data scientists do is make discoveries while swimming in data. It’s their preferred method of navigating the world around them. At ease in the digital realm, they are able to bring structure to large quantities of formless data and make analysis possible. They identify rich data sources, join them with other, potentially incomplete data sources, and clean the resulting set. In a competitive landscape where challenges keep changing and data never stop flowing, data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data.

Data scientists realize that they face technical limitations, but they don’t allow that to bog down their search for novel solutions. As they make discoveries, they communicate what they’ve learned and suggest its implications for new business directions. Often they are creative in displaying information visually and making the patterns they find clear and compelling. They advise executives and product managers on the implications of the data for products, processes, and decisions.

Given the nascent state of their trade, it often falls to data scientists to fashion their own tools and even conduct academic-style research. Yahoo, one of the firms that employed a group of data scientists early on, was instrumental in developing Hadoop. Facebook’s data team created the language Hive for programming Hadoop projects. Many other data scientists, especially at data-driven companies such as Google, Amazon, Microsoft, Walmart, eBay, LinkedIn, and Twitter, have added to and refined the tool kit.

What kind of person does all this? What abilities make a data scientist successful? Think of him or her as a hybrid of data hacker, analyst, communicator, and trusted adviser. The combination is extremely powerful—and rare.

Data scientists’ most basic, universal skill is the ability to write code. This may be less true in five years’ time, when many more people will have the title “data scientist” on their business cards. More enduring will be the need for data scientists to communicate in language that all their stakeholders understand—and to demonstrate the special skills involved in storytelling with data, whether verbally, visually, or—ideally—both.

But we would say the dominant trait among data scientists is an intense curiosity—a desire to go beneath the surface of a problem, find the questions at its heart, and distill them into a very clear set of hypotheses that can be tested. This often entails the associative thinking that characterizes the most creative scientists in any field. For example, we know of a data scientist studying a fraud problem who realized that it was analogous to a type of DNA sequencing problem. By bringing together those disparate worlds, he and his team were able to craft a solution that dramatically reduced fraud losses.

Perhaps it’s becoming clear why the word “scientist” fits this emerging role. Experimental physicists, for example, also have to design equipment, gather data, conduct multiple experiments, and communicate their results. Thus, companies looking for people who can work with complex data have had good luck recruiting among those with educational and work backgrounds in the physical or social sciences. Some of the best and brightest data scientists are PhDs in esoteric fields like ecology and systems biology. George Roumeliotis, the head of a data science team at Intuit in Silicon Valley, holds a doctorate in astrophysics. A little less surprisingly, many of the data scientists working in business today were formally trained in computer science, math, or economics. They can emerge from any field that has a strong data and computational focus.

page 3

It’s important to keep that image of the scientist in mind—because the word “data” might easily send a search for talent down the wrong path. As Portillo told us, “The traditional backgrounds of people you saw 10 to 15 years ago just don’t cut it these days.” A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed. A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data—and also not at actually analyzing the data. And while people without strong social skills might thrive in traditional data professions, data scientists must have such skills to be effective.

How to Find the Data Scientists You Need

Roumeliotis was clear with us that he doesn’t hire on the basis of statistical or analytical capabilities. He begins his search for data scientists by asking candidates if they can develop prototypes in a mainstream programming language such as Java. Roumeliotis seeks both a skill set—a solid foundation in math, statistics, probability, and computer science—and certain habits of mind. He wants people with a feel for business issues and empathy for customers. Then, he says, he builds on all that with on-the-job training and an occasional course in a particular technology.

Several universities are planning to launch data science programs, and existing programs in analytics, such as the Master of Science in Analytics program at North Carolina State, are busy adding big data exercises and coursework. Some companies are also trying to develop their own data scientists. After acquiring the big data firm Greenplum, EMC decided that the availability of data scientists would be a gating factor in its own—and customers’—exploitation of big data. So its Education Services division launched a data science and big data analytics training and certification program. EMC makes the program available to both employees and customers, and some of its graduates are already working on internal big data initiatives.

As educational offerings proliferate, the pipeline of talent should expand. Vendors of big data technologies are also working to make them easier to use. In the meantime one data scientist has come up with a creative approach to closing the gap. The Insight Data Science Fellows Program, a postdoctoral fellowship designed by Jake Klamka (a high-energy physicist by training), takes scientists from academia and in six weeks prepares them to succeed as data scientists. The program combines mentoring by data experts from local companies (such as Facebook, Twitter, Google, and LinkedIn) with exposure to actual big data challenges. Originally aiming for 10 fellows, Klamka wound up accepting 30, from an applicant pool numbering more than 200. More organizations are now lining up to participate. “The demand from companies has been phenomenal,” Klamka told us. “They just can’t get this kind of high-quality talent.”

Why Would a Data Scientist Want to Work Here?

Even as the ranks of data scientists swell, competition for top talent will remain fierce. Expect candidates to size up employment opportunities on the basis of how interesting the big data challenges are. As one of them commented, “If we wanted to work with structured data, we’d be on Wall Street.” Given that today’s most qualified prospects come from non business backgrounds, hiring managers may need to figure out how to paint an exciting picture of the potential for breakthroughs that their problems offer.

Pay will of course be a factor. A good data scientist will have many doors open to him or her, and salaries will be bid upward. Several data scientists working at start-ups commented that they’d demanded and got large stock option packages. Even for someone accepting a position for other reasons, compensation signals a level of respect and the value the role is expected to add to the business. But our informal survey of the priorities of data scientists revealed something more fundamentally important. They want to be “on the bridge.” The reference is to the 1960s television show Star Trek, in which the starship captain James Kirk relies heavily on data supplied by Mr. Spock. Data scientists want to be in the thick of a developing situation, with real-time awareness of the evolving set of choices it presents.

Considering the difficulty of finding and keeping data scientists, one would think that a good strategy would involve hiring them as consultants. Most consulting firms have yet to assemble many of them. Even the largest firms, such as Accenture, Deloitte, and IBM Global Services, are in the early stages of leading big data projects for their clients. The skills of the data scientists they do have on staff are mainly being applied to more-conventional quantitative analysis problems. Offshore analytics services firms, such as Mu Sigma, might be the ones to make the first major inroads with data scientists.

But the data scientists we’ve spoken with say they want to build things, not just give advice to a decision maker. One described being a consultant as “the dead zone—all you get to do is tell someone else what the analyses say they should do.” By creating solutions that work, they can have more impact and leave their marks as pioneers of their profession.

page 4

Care and Feeding

Data scientists don’t do well on a short leash. They should have the freedom to experiment and explore possibilities. That said, they need close relationships with the rest of the business. The most important ties for them to forge are with executives in charge of products and services rather than with people overseeing business functions. As the story of Jonathan Goldman illustrates, their greatest opportunity to add value is not in creating reports or presentations for senior executives but in innovating with customer-facing products and processes.

LinkedIn isn’t the only company to use data scientists to generate ideas for products, features, and value-adding services. At Intuit data scientists are asked to develop insights for small-business customers and consumers and report to a new senior vice president of big data, social design, and marketing. GE is already using data science to optimize the service contracts and maintenance intervals for industrial products. Google, of course, uses data scientists to refine its core search and ad-serving algorithms. Zynga uses data scientists to optimize the game experience for both long-term engagement and revenue. Netflix created the well-known Netflix Prize, given to the data science team that developed the best way to improve the company’s movie recommendation system. The test-preparation firm Kaplan uses its data scientists to uncover effective learning strategies.

There is, however, a potential downside to having people with sophisticated skills in a fast-evolving field spend their time among general management colleagues. They’ll have less interaction with similar specialists, which they need to keep their skills sharp and their tool kit state-of-the-art. Data scientists have to connect with communities of practice, either within large firms or externally. New conferences and informal associations are springing up to support collaboration and technology sharing, and companies should encourage scientists to become involved in them with the understanding that “more water in the harbor floats all boats.”

Data scientists tend to be more motivated, too, when more is expected of them. The challenges of accessing and structuring big data sometimes leave little time or energy for sophisticated analytics involving prediction or optimization. Yet if executives make it clear that simple reports are not enough, data scientists will devote more effort to advanced analytics. Big data shouldn’t equal “small math.”

The Hot Job of the Decade

Hal Varian, the chief economist at Google, is known to have said, “The sexy job in the next 10 years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?”

If “sexy” means having rare qualities that are much in demand, data scientists are already there. They are difficult and expensive to hire and, given the very competitive market for their services, difficult to retain. There simply aren’t a lot of people with their combination of scientific background and computational and analytical skills.

Data scientists today are akin to Wall Street “quants” of the 1980s and 1990s. In those days people with backgrounds in physics and math streamed to investment banks and hedge funds, where they could devise entirely new algorithms and data strategies. Then a variety of universities developed master’s programs in financial engineering, which churned out a second generation of talent that was more accessible to mainstream firms. The pattern was repeated later in the 1990s with search engineers, whose rarefied skills soon came to be taught in computer science programs.

page 5

One question raised by this is whether some firms would be wise to wait until that second generation of data scientists emerges, and the candidates are more numerous, less expensive, and easier to vet and assimilate in a business setting. Why not leave the trouble of hunting down and domesticating exotic talent to the big data start-ups and to firms like GE and Walmart, whose aggressive strategies require them to be at the forefront?

The problem with that reasoning is that the advance of big data shows no signs of slowing. If companies sit out this trend’s early days for lack of talent, they risk falling behind as competitors and channel partners gain nearly unassailable advantages. Think of big data as an epic wave gathering now, starting to crest. If you want to catch it, you need people who can surf.

Rage Against the Algorithms, by NICHOLAS DIAKOPOULOS

Rage Against the Algorithms

How can we know the biases of a piece of software? By reverse engineering it, of course.

 OCT 3 2013, 5:14 PM ET

When was the last time you read an online review about a local business or service on a platform like Yelp? Of course you want to make sure the local plumber you hire is honest, or that even if the date is dud, at least the restaurant isn’t lousy. A recent survey found that 76 percent of consumers check online reviews before buying, so a lot can hinge on a good or bad review. Such sites have become so important to local businesses that it’s not uncommon for scheming owners to hire shills to boost themselves or put down their rivals.

To protect users from getting duped by fake reviews Yelp employs an algorithmic review reviewer which constantly scans reviews and relegates suspicious ones to a “filtered reviews” page, effectively de-emphasizing them without deleting them entirely. But of course that algorithm is not perfect, and it sometimes de-emphasizes legitimate reviews and leaves actual fakes intact—oops. Some businesses have complained, alleging that the filter can incorrectly remove all of their most positive reviews, leaving them with a lowly one- or two-stars average.

This is just one example of how algorithms are becoming ever more important in society, for everything from search engine personalizationdiscrimination,defamation, and censorship online, to how teachers are evaluated, how markets work, how political campaigns are run, and even how something like immigration is policed. Algorithms, driven by vast troves of data, are the new power brokers in society, both in the corporate world as well as in government.

 They have biases like the rest of us. And they make mistakes. But they’re opaque, hiding their secrets behind layers of complexity. How can we deal with the power that algorithms may exert on us? How can we better understand where they might be wronging us?

Transparency is the vogue response to this problem right now. The big “open data” transparency-in-government push that started in 2009 was largely the result of an executive memo from President Obama. And of course corporations are on board too; Google publishes a biannual transparency report showing how often they remove or disclose information to governments. Transparency is an effective tool for inculcating public trust and is even the way journalists are now trained to deal with the hole where mighty Objectivity once stood.

But transparency knows some bounds. For example, though the Freedom of Information Act facilitates the public’s right to relevant government data, it has no legal teeth for compelling the government to disclose how that data was algorithmically generated or used in publicly relevant decisions (extensions worth considering).

Moreover, corporations have self-imposed limits on how transparent they want to be, since exposing too many details of their proprietary systems may undermine a competitive advantage (trade secrets), or leave the system open to gaming and manipulation. Furthermore, whereas transparency of data can be achieved simply by publishing a spreadsheet or database, transparency of an algorithm can be much more complex, resulting in additional labor costs both in creation as well as consumption of that information—a cognitive overload that keeps all but the most determined at bay. Methods for usable transparency need to be developed so that the relevant aspects of an algorithm can be presented in an understandable way.

Given the challenges to employing transparency as a check on algorithmic power, a new and complementary alternative is emerging. I call it algorithmic accountability reporting. At its core it’s really about reverse engineering—articulating the specifications of a system through a rigorous examination drawing on domain knowledge, observation, and deduction to unearth a model of how that system works.

As interest grows in understanding the broader impacts of algorithms, this kind of accountability reporting is already happening in some newsrooms, as well as in academic circles. At the Wall Street Journal a team of reporters probed e-commerce platforms to identify instances of potential price discrimination in dynamic and personalized online pricing. By polling different websites they were able to spot several, such as Staples.com, that were adjusting prices dynamically based on the location of the person visiting the site. At the Daily Beast, reporter Michael Keller dove into the iPhone spelling correction feature to help surface patterns of censorship and see which words, like “abortion,” the phone wouldn’t correct if they were misspelled. In my own investigation for Slate, I traced the contours of the editorial criteria embedded in search engine autocomplete algorithms. By collecting hundreds of autocompletions for queries relating to sex and violence I was able to ascertain which terms Google and Bing were blocking or censoring, uncovering mistakes in how these algorithms apply their editorial criteria.

All of these stories share a more or less common method. Algorithms are essentially black boxes, exposing an input and output without betraying any of their inner organs. You can’t see what’s going on inside directly, but if you vary the inputs in enough different ways and pay close attention to the outputs, you can start piecing together some likeness for how the algorithm transforms each input into an output. The black box starts to divulge some secrets.

Algorithmic accountability is also gaining traction in academia. At Harvard, Latanya Sweeney has looked at how online advertisements can be biased by the racial association of names used as queries. When you search for “black names” as opposed to “white names” ads using the word “arrest” appeared more often for online background check service Instant Checkmate. She thinks the disparity in the use of “arrest” suggests a discriminatory connection between race and crime. Her method, as with all of the other examples above, does point to a weakness though: Is the discrimination caused by Google, by Instant Checkmate, or simply by pre-existing societal biases? We don’t know, and correlation does not equal intention. As much as algorithmic accountability can help us diagnose the existence of a problem, we have to go deeper and do more journalistic-style reporting to understand the motivations or intentions behind an algorithm. We still need to answer the question of why.

And this is why it’s absolutely essential to have computational journalists not just engaging in the reverse engineering of algorithms, but also reporting and digging deeper into the motives and design intentions behind algorithms. Sure, it can be hard to convince companies running such algorithms to open up in detail about how their algorithms work, but interviews can still uncover details about larger goals and objectives built into an algorithm, better contextualizing a reverse-engineering analysis. Transparency is still important here too, as it adds to the information that can be used to characterize the technical system.

Despite the fact that forward thinkers like Larry Lessig have been writing for some time about how code is a lever on behavior, we’re still in the early days of developing methods for holding that code and its influence accountable. “There’s no conventional or obvious approach to it. It’s a lot of testing or trial and error, and it’s hard to teach in any uniform way,” noted Jeremy Singer-Vine, a reporter and programmer who worked on the WSJ price discrimination story. It will always be a messy business with lots of room for creativity, but given the growing power that algorithms wield in society it’s vital to continue to develop, codify, and teach more formalized methods of algorithmic accountability. In the absence of new legal measures, it may just provide a novel way to shed light on such systems, particularly in cases where transparency doesn’t or can’t offer much clarity. 

NICHOLAS DIAKOPOULOS is a Tow Fellow at the Columbia University Graduate School of Journalism. 

 

REINVENTING SOCIETY IN THE WAKE OF BIG DATA, Alex Pentland

A Conversation with Alex (Sandy) Pentland [8.30.12]

With Big Data we can now begin to actually look at the details of social interaction and how those play out, and are no longer limited to averages like market indices or election results. This is an astounding change. The ability to see the details of the market, of political revolutions, and to be able to predict and control them is definitely a case of Promethean fireit could be used for good or for ill, and so Big data brings us to interesting times. We’re going to end up reinventing what it means to have a human society.

ALEX ‘SANDY’ PENTLAND is a pioneer in big data, computational social science, mobile and health systems, and technology for developing countries. He is one of the most-cited computer scientists in the world and was named by Forbes as one of the world’s seven most powerful data scientists. He currently directs the 

Sandy Pentland’s Edge Bio Page


[24:08 minutes]


REINVENTING SOCIETY IN THE WAKE OF BIG DATA

[SANDY PENTLAND:] Recently I seem to have become MIT’s Big Data guy, with people like Tim O’Reilly and “Forbes” calling me one of the seven most powerful data scientists in the world. I’m not sure what all of that means, but I have a distinctive view about Big Data, so maybe it is something that people want to hear.

I believe that the power of Big Data is that it is information about people’s behavior instead of information about their beliefs. It’s about the behavior of customers, employees, and prospects for your new business. It’s not about the things you post on Facebook, and it’s not about your searches on Google, which is what most people think about, and it’s not data from internal company processes and RFIDs. This sort of Big Data comes from things like location data off of your cell phone or credit card, it’s the little data breadcrumbs that you leave behind you as you move around in the world.

What those breadcrumbs tell is the story of your life. It tells what you’ve chosen to do. That’s very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy. Big data is increasingly about real behavior, and by analyzing this sort of data, scientists can tell an enormous amount about you. They can tell whether you are the sort of person who will pay back loans. They can tell you if you’re likely to get diabetes.

They can do this because the sort of person you are is largely determined by your social context, so if I can see some of your behaviors, I can infer the rest, just by comparing you to the people in your crowd. You can tell all sorts of things about a person, even though it’s not explicitly in the data, because people are so enmeshed in the surrounding social fabric that it determines the sorts of things that they think are normal, and what behaviors they will learn from each other.

As a consequence analysis of Big Data is increasingly about finding connections, connections with the people around you, and connections between people’s behavior and outcomes. You can see this in all sorts of places. For instance, one type of Big Data and connection analysis concerns financial data. Not just the flash crash or the Great Recession, but also all the other sorts of bubbles that occur. What these are is these are systems of people, communications, and decisions that go badly awry. Big Data shows us the connections that cause these events. Big data gives us the possibility of understanding how these systems of people and machines work, and whether they’re stable.

The notion that it is connections between people that is really important is key, because researchers have mostly been trying to understand things like financial bubbles using what is called Complexity Science or Web Science. But these older ways of thinking about Big Data leaves the humans out of the equation. What actually matters is how the people are connected together by the machines and how, as a whole, they create a financial market, a government, a company, and other social structures.

Because it is so important to understand these connections Asu Ozdaglar and I have recently created the MIT Center for Connection Science and Engineering, which spans all of the different MIT departments and schools. It’s one of the very first MIT-wide Centers, because people from all sorts of specialties are coming to understand that it is the connections between people that is actually the core problem in making transportation systems work well, in making energy grids work efficiently, and in making financial systems stable. Markets are not just about rules or algorithms; they’re about people and algorithms together.

Understanding these human-machine systems is what’s going to make our future social systems stable and safe. We are getting beyond complexity, data science and web science, because we are including people as a key part of these systems. That’s the promise of Big Data, to really understand the systems that make our technological society. As you begin to understand them, then you can build systems that are better. The promise is for financial systems that don’t melt down, governments that don’t get mired in inaction, health systems that actually work, and so on, and so forth.

The barriers to better societal systems are not about the size or speed of data. They’re not about most of the things that people are focusing on when they talk about Big Data. Instead, the challenge is to figure out how to analyze the connections in this deluge of data and come to a new way of building systems based on understanding these connections.

Changing The Way We Design Systems

With Big Data traditional methods of system building are of limited use. The data is so big that any question you ask about it will usually have a statistically significant answer. This means, strangely, that the scientific method as we normally use it no longer works, because almost everything is significant!  As a consequence the normal laboratory-based question-and-answering process, the method that we have used to build systems for centuries, begins to fall apart.

Big data and the notion of Connection Science is outside of our normal way of managing things. We live in an era that builds on centuries of science, and our methods of building of systems, governments, organizations, and so on are pretty well defined. There are not a lot of things that are really novel. But with the coming of Big Data, we are going to be operating very much out of our old, familiar ballpark.

With Big Data you can easily get false correlations, for instance, “On Mondays, people who drive to work are more likely to get the flu.” If you look at the data using traditional methods, that may actually be true, but the problem is why is it true? Is it causal? Is it just an accident? You don’t know. Normal analysis methods won’t suffice to answer those questions. What we have to come up with is new ways to test the causality of connections in the real world far more than we have ever had to do before. We no can no longer rely on laboratory experiments; we need to actually do the experiments in the real world.

The other problem with Big Data is human understanding. When you find a connection that works, you’d like to be able to use it to build new systems, and that requires having human understanding of the connection. The managers and the owners have to understand what this new connection means. There needs to be a dialogue between our human intuition and the Big Data statistics, and that’s not something that’s built into most of our management systems today. Our managers have little concept of how to use big data analytics, what they mean, and what to believe.

In fact, the data scientists themselves don’t have much of intuition either…and that is a problem. I saw an estimate recently that said 70 to 80 percent of the results that are found in the machine learning literature, which is a key Big Data scientific field, are probably wrong because the researchers didn’t understand that they were overfitting the data. They didn’t have that dialogue between intuition and causal processes that generated the data. They just fit the model and got a good number and published it, and the reviewers didn’t catch it either. That’s pretty bad because if we start building our world on results like that, we’re going to end up with trains that crash into walls and other bad things. Management using Big Data is actually a radically new thing.

This last year at Davos I ran several sessions around Big Data with the CEOs of leading companies in this area, and it was very clear that there’s a whole new way of doing things that’s just now developing. Some of them, like Palantir and TIBCO, are making progress at this, but to most of the people in the room this was brand new, and they had not gotten up to speed about it at all.

Another important issue with Big Data is that since this data is mostly about people, there are enormous issues about privacy, data ownership, and data control. You can imagine using Big Data to make a world that is incredibly invasive, incredibly ‘Big Brother’… George Orwell was not nearly creative enough when he wrote 1984.

For the last several years I’ve been helping to run sessions at the World Economic Forum around sourcing personal data and ownership of the data, and that’s ended pretty successfully with what I call the New Deal on Data. The Chairman of the Federal Trade Commission, who’s been part of the group, put forward the U.S. “Consumer Data Bill of Rights,” and in the EU, the Justice Commissioner declared a version of this New Deal to be a basic human right.

Both of these regulatory declarations put the individual much more in charge of data that’s about them. This is a major step to making Big Data safer and more transparent, as well as more liquid and available, because people can now choose to share data. It is a vast improvement over having the data being locked away in industry silos where nobody even knows it’s there.

Adam Smith And Karl Marx Were Wrong

These Big Data issues are important, but there are bigger things afoot. As you move into a society driven by Big Data most of the ways we think about the world change in a rather dramatic way. For instance, Adam Smith and Karl Marx were wrong, or at least had only half the answers. Why? Because they talked about markets and classes, but those are aggregates. They’re averages.

While it may be useful to reason about the averages, social phenomena are really made up of millions of small transactions between individuals. There are patterns in those individual transactions that are not just averages, they’re the things that are responsible for the flash crash and the Arab spring. You need to get down into these new patterns, these micro-patterns, because they don’t just average out to the classical way of understanding society. We’re entering a new era of social physics, where it’s the details of all the particles—the you and me—that actually determine the outcome. 

Reasoning about markets and classes may get you half of the way there, but it’s this new capability of looking at the details, which is only possible through Big Data, that will give us the other 50 percent of the story. We can potentially design companies, organizations, and societies that are more fair, stable and efficient as we get to really understand human physics at this fine-grain scale. This new computational social science offers incredible possibilities.

This is the first time in human history that we have the ability to see enough about ourselves that we can hope to actually build social systems that work qualitatively better than the systems we’ve always had. That’s a remarkable change. It’s like the phase transition that happened when writing was developed or when education became ubiquitous, or perhaps when people began being tied together via the Internet.

The fact that we can now begin to actually look at the dynamics of social interactions and how they play out, and are not just limited to reasoning about averages like market indices is for me simply astonishing. To be able to see the details of variations in the market and the beginnings of political revolutions, to predict them, and even control them, is definitely a case of Promethean fire. Big Data can be used for good or bad, but either way it brings us to interesting times. We’re going to reinvent what it means to have a human society.

Creating A Data-Driven Society

One of the great questions is: who is this new Data Driven world going to be for and what is it going to look like? People ask if this just for the Davos attendees or for everybody? That’s a question of values and ethics, and that’s why people have to be debating this now, and why I’m talking about this—to start the conversation. But I will say however that all the conversations I’ve been at in Davos have had an extremely strong egalitarian element. Most people are advocates for the poor. Many are people from developing countries—an enormous number, not just a token scattering. There’s a real focus on building a sustainable future, which means one in which there aren’t large chunks of the population left out in the cold. Obviously not everybody is 100 percent devoted to that agenda, but most are.

A key insight is that your data is worth more if you share it because it enables systems like public health. Data about the way you behave and where you go, and that can be used to can stop the spread of infectious disease. If you have children, you don’t want to see them die of an H1N1 pandemic. How are you going to stop that? Well, it turns out that if you can actually watch people’s behavior in real time…something that is quite possible today…you can tell when each individual person is getting sick. This means you can actually see the spread of influenza from person to person on an individual level. And if you can see it, you can stop it. You can begin to build a world where infectious pandemics cease to be as much of a threat.

Similarly, if you’re worried about global warming, we now know how patterns of mobility relate to productivity (and I just showed some examples of those—we are doing a lot really amazing science around this). This means you can design cities that are far more efficient, far more human, and burn an awful lot less energy. But you need to be able to see the people moving around in order to be able to get these results. That’s another instance where sharing your data is invaluable to you personally. It’s everybody contributing his or her data that’s going to make a greener world, and that is worth far more than the simple cash value of the data.

However today the data is siloed off and unavailable, and that was the one of the core reasons I proposed the New Deal on Data to the World Economic Forum. Since then the idea has run through various discussions turned into the Consumer Data Bill of Rights in the United States, and the declaration on Data Rights in the EU. The core idea is that when data is in silos you can’t make use of it either for evil or for the public good, and we need the public good. We need to stop pandemics. We need to make a greener world. We need to make a fairer world.

Who Owns The Data In A Data-Driven Society?

How do you get the data out of those silos? The first step is you have to figure out who owns that data. Does the telephone company own it, just because it happened to be collected while you were walking around with your phone? Maybe they have some right to use it. But what the discussions are among all the participants, including the telephone companies, is that you’re the only one that has final disposal of it. They would have the ability to keep copies to offer services that you’ve requested, but you, the individual, have to have the final say.

Some situations are, of course, more complex. What about if the data is a transaction with a merchant? Well, they have a right to the data too. But by assigning rights of ownership to people (which is not exactly the same as legal ownership) what you do is you make it possible to break data out of the silos. You’ve turned it into a personal asset that can then be shared for value in return. You can make it a liquid asset that can be used to build government systems, social systems, or for-profit systems. That’s the world we’re moving towards.

Is there opposition to this? Surprisingly little. The incumbents in the Internet are probably the major opposition because (and I don’t mean to pick on them) Facebook and Google grew up in a completely unregulated environment. It is natural for them to think that they have control over the data, but now they’re slowly, slowly coming around to the idea that they’re going to have to compromise on that.

However the people who have the most valuable data are the banks, the telephone companies, the medical companies, and they’re very highly regulated industries. As a consequence they can’t really leverage that data the way they’d like to unless they get buy-in from both the consumer and the regulators. The deal that they’ve been willing to cut is that they will give consumers control over their data in return for being able to make them offers about using their data.

That gets these companies out of the regulator’s pocket. It gives them a white hat, because they explicitly asked you if you wanted to op in, and it lets them make money, which is what they desperately want. And it appears that if you treat people’s data in this sort of responsible manner, people will willingly share their data. It is a win-win-win solution to the privacy problem, and it’s the companies that grew up in an unregulated environment, or the companies that are in gray markets that are likely to dry up, that are most strongly opposed.

We are beginning to see is services that leverage personal data in this sort of respectful manner. Services such as really personal recommendations, identity certification without passwords, and personal public services for transportation, health, and so forth. All these areas are undergoing tectonic changes, and the more that we can use specific data about specific people, the better we can make the system work.

These dramatic improvements in societies’ systems goes back to what I was saying earlier. Today societies’ systems are built on big averages and indices, e.g., this class of people do this and this market’s moving that way. But really, it’s all made up of millions and millions of small interactions, and with Big Data we can get down and design things that really work for us on a personal level, rather than just being treated as another type A4 consumer.

Organizations With Hard Information Boundaries Will Tend To Dissolve

I got to these issues through a long and varied history. I started off doing a lot of signal processing machine vision. I have a background in psychology as well, and am concerned with how data and people come together in social systems. For instance, we developed some of the first wearable computing devices. The Google Glass project comes out of my group…the guys that are building it are my former students. But as a result of these sorts of projects it became obvious to me that the most important thing was not the user interface or the device, it was the data about people. Later, as cell phones became more ubiquitous, it was clear that that they were going to be the biggest source of data in the world.

If you could see everybody in the world all the time, where they were, what they were doing, who they spent time with, then you could create an entirely different world. You could engineer transportation, energy, and health systems that would be dramatically better. It’s this history of thinking about signals and people together, and how people work via these computer systems, and what data about human behavior can do, that led me to the realization that we’re at a phase transition. We are moving from the reasoning of the enlightenment about classes and about markets to fine grain understanding of individual interactions and systems built on fine grain data sharing.

This new world could make George Orwell look like an unimaginative third stringer. It became really clear you had to think hard about the privacy and data ownership issues. Things that George Orwell didn’t realize were that is that you can watch the patterns of people interacting then you can figure out things like who they’re going to vote for and how they’re going to react to various situations like changes of regulation, and so forth. You could build something that, to a first approximation, would be the real evil empire. And, of course, some people are going to try and do that.

At the same time, there are some elements of this new data driven world that are really promising. For instance, the most efficient and robust architectures tend to be ones that have no central points. It means that there’s no single place for a dictator to grab control. They have to actually go to every house to really control the data. In addition, I see government policies going in the right directions, to minimize these sorts of dangers.

Also there is inherent in a society built on data sharing a certain level of transparency and choice for individuals that I believe will tend to mitigate against central control. It tends to dissolve the power of the state and big organizations because you can build things that are far more efficient and robust if they’re distributed and without the hard information boundaries that you see today.

That means that the service-oriented government, as it were, or the service-oriented organization will tend to have better offerings for a lower price, as opposed to the ones that try to own the customer or control the citizen. As a consequence I expect to see that organizations with hard information boundaries will tend to dissolve, because there will be competition from things that are better that don’t have the hard boundaries and don’t try to own your data.

 
 
Edge.org is a nonprofit private operating foundation under Section 501(c)(3) of the Internal Revenue Code.
Copyright © 2013 By Edge Foundation, Inc All Rights Reserved.