Toxic In Large Quantities: Personal Information in the Information Age

by Alex Feinman
about

You, the people, are lagging behind in an important technology area: the ability to make use of your own personal information.

For at least ten years, and probably twice that for the smarter companies -- and at least thirty for the insightful credit card industry — companies that track their consumers have held a competitive advantage over those that don't. They also have an advantage over you. They know you better than you know yourself, and use that information to their profit.

I mean this literally. Do you know how much you spent on gas last year? Some of you may, if you keep accurate records. But all of your credit card companies do. Do you know which pages you looked at the most on Amazon last year? Do you have a good feeling about what movie you would like to see tonight? Are you more likely to buy two 3-liter soda bottles or three 2-liter ones, if you had a big party? What discount would make you buy generic soda over name-brand? Who is the person you IM with the most? Who is the person who sends you the second-most email? How long on average does it take you to respond to email, and who is the person you respond to the most quickly? Which of your friends has the best credit rating?

These are all answers that other people have that you probably don't. This is the curse, and the blessing, of the information age. For the first time in history, other people know more about you than you yourself do.

To illustrate this point refer to the two halves of Figure 1, which show how much info is known about you by various parties. In past days, you knew yourself pretty well; your friends and acquaintances might know as much or less; and strangers knew next to nothing about you. This arrangement is intuitive and feels natural. In the modern era, however, this situation is altered -- you know the same about yourself as ever, your friends likewise, but complete strangers know you better than you know yourself.

Figure 1: Strangers know you better than you know yourself.

This trend toward greater information sharing and processing is unlikely to reverse. Two reductii ad absurdam exist: the panopticon, where everyone can find out everything about everyone else at any time; and the info prison, where we spend 100% of our time safeguarding information and sending out info 'chaff' to confuse the pattern recognizers. Neither is likely to occur. The course we're on now follows down the middle, where information availability coupled with information processing technology availability creates an ever-increasing imbalance of power between the analyzers (Citibank, Google, the NSA, and your stalker) and the analyzees (you), with information really only flowing one-way. While it costs you nothing to have your information collected, it puts you at a competitive disadvantage. As anyone who has been the target of successful targeting advertising knows, yielding information superiority to anyone is a recipe for being disadvantaged; your superiors are one step ahead of you, able to affect your decision-making and guide your behavior. For "which fast-food place should I eat at?" this is annoying but not terrible; for "which country should we be at war with?" this is world-changing.

How It Works

The first step is collection. Everyone else in the universe is better than you at data collection. Credit cards maintain records of your transactions and routinely attempt to profile your actions, for security, credit risk assessment, targeted marketing. Many retailers maintain a database with your past purchase history in it, and will classify you into a stereotype to send you the catalogue with the best chance of a sale. Online, collection gets even easier and more prevalent: websites you visit track what you click on, when, and so forth. Same thing for email -- if you use Google Mail, then Google knows best whom you send email to; Yahoo Mail gives Yahoo! access to all your correspondence. Finally, the network traffic itself is interceptable and monitorable. An ongoing lawsuit by the EFF alleges that a government agency, the NSA, has installed machinery at crucial points in the internet infrastructure to examine and process internet traffic [1].

Once collected, the information needs to be aggregated with comparable information from other people, and refined. Companies with a business model that draws in large numbers of users, therefore, have a competitive advantage here, benefiting large businesses and monopolies, and impairing small businesses and individuals.

The next step in mining for information is inferencing. Suppose you have recorded all of your actions, entered relevant personal information, and collected all your old emails back for twenty years. (I have fifteen years collected, personally.) What the heck do you do with all this information? You need clever inferencing algorithms to sort through the heaps of information and come up with conclusions. Companies have a profit motive for pursuing these algorithms secretly, and they're getting much better at it.

This process is perhaps most visible at Amazon, which does a creditable job of explaining to you why and how they act on information that they collect or that you give them. Other sites are more subtle; Google aggregates what it finds in your search behavior, email messages, and chat conversations to present relevant ads and to improve search quality, leading to a system that learns who you are and what you are likely to click on. They spend more than you ever will on technology to track whom you talk with and what you talk about, under the initiative of placing more and more intelligent ads beside your email. Have you noticed an uncanny accuracy in those ads as of late? They're watching what you do, and creating very clever algorithms to figure out what it all means. AdWords provides the profit motive to drive this research -- the algorithms look for patterns of related things, and then see if anyone is offering to sell something along those lines. Find "Saints" and "Sunday" and, depending on whether "church" or "arena" is present, deliver links on prayer books or play books. There's money to be made here for well-targeted advertisements, and so research proceeds.

Toxic When Concentrated

An unintended side effect of all this collection and inferencing is the warehousing of refined information. Information is a lot like toxic waste. Toxic waste can be something that is innocuous in ambient quantities -- let's say benzene, which occurs naturally in small quantities, or U-238, which is usually mixed in with other stuff -- but has been collected for industrial purposes. The concentration makes it very dangerous, and very hard to get rid of. It's hard to store, it's hard to transport, and it's dangerous to be around.

Spread out a lot and dilute, information is relatively harmless. The fact I spent $5.50 on lunch yesterday is moderately interesting, but not significant. However, when you concentrate lots of information in one spot, or reduce it through algorithms in to concentrated information, it becomes VERY dangerous. As an example, think of the recent laptop thefts with millions of bits of sensitive info on them. You may have been one of the lucky millions to receive a form letter as a result of intrusion into TJX computers resulting in the theft of purchasing information. These incidents will grow more and more frequent as long as the value of the concentrated information exceeds the cost to steal it.

As a result, any concentration of information has to be treated as if it were toxic waste -- it needs to be carefully protected, it needs contingency plans in case of leaks or other disasters, and it must be handled by trusted operators following fault-tolerant protocols. Collections of concentrated information are valuable and dangerous, and must be treated carefully or disposed of properly. Personal info is a like toxic waste, except it's reasonably easy to get rid of (though not trivial; hitting 'delete' will only stop a novice retriever). But it's treated the same as toxic waste was a hundred years ago, plopped on open servers, or left on drives when they're end-of-lifed, or allowed to wander around on laptops with zero to minimal security. Once it occurs, an info leak is very costly to recover from -- it may be impossible, or it may take many years to really ameliorate the impact of information released into the wild. The rise in identity theft has taught to thousands the harsh lesson of how difficult it is to recover from such a theft.

We need rules about how information is aggregated and used, and what must happen to it when it reaches a certain level of concentration; but it's very hard to quantify this. Government secrets are protected through elaborate rituals, usually the moral equivalent of stuffing the information in a black box and taping the lid closed. Information is Secret until it goes through a cleaning process (which, yes, can involve a human with a black pen) to approve it for general release. But that doesn't work for aggregation of innocuous information. In the course of our daily lives we give away innocuous information all day -- one of our most dangerous secrets, our credit card number, is sent over an unsecured, unencrypted telephone line every time we use it in a store, never mind passing through at least one uncontrolled piece of electronics (the reader) on the way.

There's just no way to avoid it. Purchasing patterns are necessarily the business of at least the buyer and the merchant, and possibly countless middle men on the way. But selling that information is also a profitable business, furthering the spread of personal information -- in recent years, charities have made so much money selling their contact lists to less reputable companies that the allure of selling this refined information is nearly inescapable. Minnesota Public Radio was sued over its potentially misleading practices in sharing its donor-member list with other markets -- such as the Democratic National Committee -- resulting in new requirements for revealing how personal information will be shared with other businesses. Unfortunately, this sharing benefits industry far more than it benefits you -- and while new regulations restrict how personal information can be shared and disclosed, the simple fact is that law cannot eliminate its capture, sharing, and processing.

Addressing the Imbalance

At this point, the genie is out of the bottle: info collectors have the ability to know a lot about you, and use that information to their profit. It is extremely unlikely that this profitable capability will go away any time soon, either through technological privacy guards, or legal restrictions. It is difficult to explain to the average voter how valuable their personal information is, or why getting more accurate advertisements and better deals on the products they desire could be a bad thing, and without that sort of support it is impossible to build legislation strong enough to prevent information theft.

One way to ameliorate the situation is to attempt to reduce the gap between what they know and what you know. The first way is to increase what you know about others; and the second is to increase what you know about yourself. The first is a difficult and expensive proposition, and while important, does not really address imbalances of power: while it is important to know who you are dealing with, and to hold companies accountable for undesirable behavior, in the end it is their business to know something about you, while no one is paying you to check up on them. The latter situation, enabling you to understand as much about yourself as strangers do is a rational goal to aim for. To address the imbalance in information availability we can attempt to achieve some sort of symmetry in data mining by democratizing the process.

This bottom-up refining of information is already underway: the internet, while providing an easy way to observe behavior and gather information, also makes it easy to build and share thousands of refined information sources. Giving people the ability to share content, and then to refine that content, is the driving force behind many "Web 2.0" applications. Wikipedia allows successive generations of editors to refine raw content into useful information, makes cross-referencing straightforward, and allows easy access for all. Car forums give owners the chance to share experiences and data points, refine it into FAQs and archived discussions, and share this information to the benefit all users. Health community sites let patients compare symptoms, care options, and drug interactions, increasing the chances of catching drug prescription errors, missed diagnoses, and promoting self-awareness.

In contrast, in the personal information market -- purchasing patterns, searching history, and so forth — there haven't been any useful applications to date. Gathering your own personal information shouldn't be hard, but it is. For example, your credit purchases could be available to you in a machine-readable format, so that you could analyze them yourself, or aggregate them with others to (for example) compare your spending with that of your peers. But getting credit card companies to provide this information can be problematic. Visa knows well how valuable that information, and will claim it's some sort of hardship for them to provide machine-readable information to you in a timely, zero-cost fashion. It's not technically hard -- that information is stored in their machines in a useful format. It just has an opportunity cost to share, one which they will not pay without recompense. Shared sources of personal information are few and far between, maintaining the advantage these companies have achieved by recording and analyzing your behavior.

We can only hope that in the future we will develop applications that give us more ability to capture our own personal information, share it with others, and build up an understanding of our behavior.

Addendum: During the writing of this article, a new bill was introduced in the Senate which seeks to mandate reporting of the Federal agency use of data mining for identifying terrorist or criminal activity [2]. Congressional literature being notoriously difficult to thoroughly assess, I will not engage in an analysis of the bill here, but it is a good indication that the Congress is considering the issue of data mining and its potential implications for privacy.

Notes

[1] Bamford, J. (2006) "Big Brother Is Listening", The Atlantic Monthly, April 2006. Accessed at: http://www.theatlantic.com/doc/200604/nsa-surveillance

[2] "EFF's Class-Action Lawsuit Against AT&T for Collaboration with Illegal Domestic Spying Program," Electronic Frontier Foundation. Accessed on 9 Feb 2007 at http://www.eff.org/legal/cases/att/