09:44 AM

Managing a data intensive case: How to find a needle without drowning in a haystack

Vol.72, No. 3 /May-June 2016

Joel Henry and Michael Pasque1

“The fact is, we’re looking for a very small number of very evil needles in a very large haystack ...”

Charles Clarke, member of British Parliament, and Home Secretary (2004 – 2006)

The practice of law continues to change, albeit slowly. Nowhere does this change impact the legal field more profoundly than the volume of data created by clients, lawyers, courts, and the public. Society has gone well beyond being “data rich” to being “data swamped.” Computers, phones, cars, even household appliances, generate enormous amounts of data every day. The practice of law requires attorneys to account for much of this data. They must understand not only the data they handle, but also their client’s data. Failure to maintain this understanding can result in state or federal penalties, adverse court decisions, sanctions to both attorney and client, or potentially even disbarment. Yet even the tech aware attorney can become overwhelmed.

A simple discovery request might include a benign-looking and well-intended request such as: “All emails sent, received, or otherwise exchanged by Acme Corporation employees relating to Joe Smith.” However, when a lawyer sends this request on to a client, or the technician who handles their technology, such a request becomes a nightmare. Each employee sends or receives an average of 125 emails per day. If Joe Smith’s controversy spanned six months and involved ten Acme employees, the volume of email to search would be in the neighborhood of 225,000. This fails to count any email attachments, which also need to be examined.

Sheer volume alone hides some nasty challenges. For example, some of those emails may have been archived (stored on backup disks), or come from another email account an employee used such as a web-based email provider. Extracting email from archives and collecting from external email servers will keep your poor technician busy for days on end. And the job isn’t over once emails have been collected. Now they must be reviewed to prevent disclosure of privileged, or otherwise confidential, information (i.e. Social Security numbers, health data, and human resource issues).

Manually reviewing such a mountain of email can’t be performed efficiently, especially when reviewers spend a majority of their time sifting through email that has nothing to do with Mr. Smith. Traditionally, 90 percent of the email collected for review has nothing at all to do with him. One could simply perform a search on the email with the words “Joe” and “Smith,” which would likely reduce that 225,000 email mountain significantly. However, keyword searches, which utilize single words or even combinations of words, both over-collect emails and miss many others. For example, if one employee had a husband named Joe, such a search would return all those emails. Alternatively, if employees referred to Joe Smith as JS, such a search would find none of those emails.

Technology provides the ability to continually create and store data that can be revised and shared at a moment’s notice. However, the ability to save all that data and then find something in it becomes a challenge. In the days of paper records, a law office had no choice but to manually review every piece of paper related to the case at hand. With the technology of today, we can now search these documents faster and more accurately. Using technology to assist document review allows us to reduce one of the most fallible links in the discovery chain: humans. Review that requires humans to spend days reading documents results in errors – lots of them. There is no doubt that machines will make errors too, but human review guided by smart technology reduces the probability of missing otherwise potentially relevant or privileged documents.

Legal Framework

The duty to preserve suffers the most profound impact of this data deluge. Just as an attorney must preserve a vehicle involved in an accident, client and other data must be preserved in the electronic data world. Unfortunately many lawyers struggle to understand technology and therefore see data preservation as a form letter to a client or a technologist. This is not the case, as data preservation requires a hold letter that specifies what to preserve.

A legal hold letter may result from current or reasonably anticipated litigation, audit, government investigation, or other such matter that suspends the normal disposition or processing of data. Legal holds may encompass procedures affecting email, document storage, database records, social media content, and even text messages. Accessibility of this data may be reasonable, or not reasonable at all – try retrieving your text messages from six months ago.

The obligation to preserve evidence arises when a party has notice that the evidence is relevant to litigation or when a party should have known that the evidence may be relevant to future litigation. Identifying the boundaries of the duty to preserve involves two related inquiries: When does the duty to preserve attach and what evidence must be preserved?2 It is not enough to wait until litigation has commenced to start preserving data; preservation begins much sooner than that. This becomes very important as companies often have data-retention policies that would eliminate the data before litigation could commence to require preservation.

The duty to preserve extends to those employees likely to have relevant information, including the retention of all relevant documents or tangible things in existence when the duty attaches.3 While this initially seems like a large task, “a party need not preserve all backup tapes even when it reasonably anticipates litigation” as doing so would cripple large clients.4 Instead the duty to preserve extends to those employees likely to have relevant information, and not the entire corporate database.5

More data results in more problems, in that the likelihood of inadvertent disclosure of privileged information to opposing counsel increases. Fortunately, in 2008 the Federal Rules of Evidence were amended, providing much needed help.

Rule 502 generally provides for protection of traditional “subject-matter waiver” standards when the protected information is inadvertently provided to the opposing party.6 To clarify, the committee directly addressed the cost/benefit problem, commonly referred to as proportionality, when discussing Rule 502, responding to the “widespread complaint that litigation costs necessary to protect against waiver . . . have become prohibitive due to [disclosure concerns].”7 While some courts previously held that even unintentional disclosure of protected information constituted waiver, Rule 502 now expressly protects this information. This is certainly helpful to those weary of having to manually review every document or email in order to ensure protection of privileged information.

Rules of evidence require authentication, which can be especially challenging with digital evidence. Such authentication can be done using metadata; however, asking for metadata with a discovery request should be done sparingly as gathering metadata associated with every document or email requires significantly more work and expense.

The term “metadata” seems to float around digital evidence and electronic discovery like pollen in the spring. However, what exactly constitutes metadata seems to escape many who use the term. Metadata can be thought of as data about data. For example, documents contain words and sentences — clearly this is data. However, metadata contains data about that document, including when it was created and modified, by whom, on which computer, and when. Metadata can be thought of as a set of unique library catalog cards, one for each document or email you create, read, modify, or copy. Thus, just as a catalog card provides background information on a book, metadata provides background information on documents, emails, messages, pictures, blog posts, web site visits, and a host of other electronic items and actions.

While metadata can be used to authenticate, admissions will always be more common. Only when someone fails to admit involvement in creating, editing or viewing a document, or sending or receiving an email will metadata be needed to authenticate. Of course data processing and electronic discovery vendors will be very happy to provide metadata at attractive rates – attractive to themselves.

Ethical Standards

ABA Model Rule 1.1 of Professional Conduct expresses in a few words the lawyer’s duty to represent all clients competently. In 2013, the ABA accepted a proposal of the ABA Commission on “Ethics 20/20” to modify one of the comments to this rule in order to make clear that a lawyer must continuously maintain familiarity with technological change in order to comprehend the manner in which technology may affect a particular representation. The added language reads, “To maintain the requisite knowledge and skill, a lawyer should keep abreast of changes in the law and its practice, including the benefits and risks associated with relevant technology, engage in continuing study and education and comply with all continuing legal education requirements to which the lawyer is subject.”8 Lack of knowledge as to digital evidence cannot excuse a lawyer from such diligence. What’s a lawyer to do? The California Rules of Professional Conduct may supply an answer. Recent changes in California rules included this ominous passage:

Attorneys who handle litigation may not ignore the requirements and obligations of electronic discovery. Depending on the factual circumstances, a lack of technological knowledge in handling e-discovery may render an attorney ethically incompetent to handle certain litigation matters involving e-discovery, absent curative assistance under rule 3-110(C), even where the attorney may otherwise be highly experienced. It also may result in violations of the duty of confidentiality, notwithstanding a lack of bad faith conduct.9

This may be the future across the nation. Regardless, the volume of data will not shrink in the future, nor will the importance of some of the needles in the growing haystack of data. The outcome of cases will turn on these needles, as will the ability of lawyers and law firms to compete with those who leverage technology to reduce costs.

Even the most tech-savvy lawyer, using software to find that needle, will likely require the expertise of technology professionals — data from multiple sources, in multiple formats, doesn’t simply appear within a software product without retrieval and conversion. The ABA addressed this issue in a formal opinion, providing that a lawyer who engages a non-lawyer (or lawyer) to provide outsourced services is required to ensure that person’s compliance with Rules 5.1 and 5.3.10 Therefore, utilizing help from outside vendors to complete a complicated e-discovery task is not unethical. It is, however, the supervising lawyer’s obligation to ensure that those tasks are delegated to individuals competent to engage in them.

Data Review

Document review continues to employ methods that date back to the invention of paper – manual review. According to the Best Practices Commentary by the Sedona Conference,“[e]ven assuming that the profession had the time and resources to continue to conduct manual review of massive sets of electronic data sets (which it does not), the relative efficacy of that approach versus utilizing newly developed automated methods of review remains very much open to debate.”11 New associates fear and loath the traditional review of documents, namely sitting in a room with other young lawyers reviewing reams of documents, electronically or in paper form, for days on end. Not only is this effort mentally exhausting but in the age of computers it requires the review of exponentially greater amounts of documents than ever before. Even if it were affordable, manual review of that amount of data leads to error rates of 35-40 percent.12

Technology assisted review (TAR) includes a software component that goes beyond simple display of electronic data within the review process. Even though the legal profession views manual review as the gold standard compared to other forms of review, the use of TAR produces accurate, efficient results while still maintaining compliance with applicable discovery rules. “One point must be stressed — it is inappropriate to hold TAR to a higher standard than keywords or manual review. Doing so discourages parties from using TAR for fear of spending more in motion practice than the savings from using TAR for review.”13 Thus TAR should be viewed on an even playing field with manual review.

Just as when you buy a new computer, there are many choices when it comes to TAR. The three primary types of software available utilize Continuous Active Learning (CAL), Simple Active Learning (SAL), or Simple Passive Learning (SPL). In a study performed by two of the most renowned experts in e-discovery, any active learning process (CAL or SAL) greatly outweighs passive learning (SPL), especially when compared to traditional review techniques.14

Active learning differs from passive in that active learning software provides the user with the most likely relevant documents to review while passive selects documents randomly, or allows the user to select documents. Active learning leverages user actions to provide increased accuracy and efficiency of the searching and tagging performed by a user while passive learning treats all data the same and provides data to the user that better represents the entire data set (based on random sampling the user sees the data set as a whole). Both processes continually learn from reviewer actions. Both methods use statistical measurements to determine when to let the software mark the remaining documents. After the software marks a set of documents, the user reviews these and corrects any mistakes. The cycle continues until statistical measurements reach pre-set goals, often set as a part of the cost/benefit analysis. Once the software has done enough analysis, the software then marks documents yet to be reviewed by the user. In this way many documents get marked but the user reviews only a portion of them.

Two very simple terms can be used to convey the effectiveness of TAR: precision and recall. Precision is the percentage of correctly marked documents within all marked documents. Recall is the percentage of correctly marked documents within all documents in the dataset. Essentially, precision measures how accurately a TAR technology uses reviewer markings to mark unreviewed documents. For example, if a user marks 100 documents relevant, and the software marks 1,000 of the remaining 9,000 documents relevant, then precision would be the percentage of the 1,000 documents the software marked that the reviewer confirms as being relevant. Recall works similarly by calculating the number of relevant documents within the remaining unmarked 9,000 documents the software should have marked relevant. An attorney can utilize these measurements to judge the progress of document review and substantiate both the use and accuracy of TAR.

As more time is spent to get closer to that ever-elusive high precision/high recall result, an exponential amount of money is spent. However, the amount of time spent trying to get one more percentage point higher recall or precision may not be proportional to the value of the case. Thus, courts often reiterate that their rulings regarding proportionality for one particular case should not be applied to other cases.15

In early 2012, federal Magistrate Judge Andrew Peck opined one of the first legal affirmations for TAR, stating that “judicial opinion now recognizes that [TAR] is an acceptable way to search for relevant ESI in appropriate cases.”16 It has been three years since that ruling was published, and courts have generally approved the use of TAR in discovery.17 It is important to note that while courts approve of the usage of TAR, it is up to the parties to determine when TAR is appropriate. Courts generally have not required TAR in discovery where the party has shown that the benefits do not outweigh the cost.18

Many lawyers fear TAR as they assume they must be statisticians or mathematicians to use it. While it is necessary to understand how the technology works, a lawyer need not understand the detailed statistics or programming behind TAR. The choice in technology lies with the producing party as courts resist dictating the use of TAR or what type of TAR to use. Thus, the system used should be dependent on the specifics of the data within the case at hand, which can only be determined on a case-by-case basis.


Data can be produced in one of four formats: native, near-native, near-paper, or paper. Each has its benefits and drawbacks, so the process used for one case may not fit the next.

A request for production may include metadata and, if so, such metadata should be delivered with the document in native format. This metadata acts much like a digital fingerprint and can be imperative in authenticating the document. Native production refers to the form ordinarily used by the producing party to store and revise the document. For example, with Microsoft Word, documents would be stored in .doc or .docx format. Production in this format delivers metadata but presents challenges when applying a Bates stamp, performing redaction, and controlling privileged or confidential data. Additionally, if the native format requires special software, such as a Computer Assisted Drawing (CAD) package, the opposing party will have no feasible way to view the native file.

The near-native production keeps as much of the original data as possible, but places all documents in a commonly read file format more easily used by the opposing party. For example, CAD drawings can be converted to PDF files. This also makes Bates stamping and redaction easier to perform.

Near-paper production simply produces an electronic file with each document presented as if it were printed and scanned into an electronic file. For example, a group of emails can be exported from Outlook into a single, multi-page, PDF file. Converting to this format allows for Bates stamping, redaction, removal of confidential information, and control of metadata. Traditionally lawyers would collect the data to be produced, print it, and then scan it back into a PDF, a costly and inefficient process. Software now allows this process to be performed seamlessly by the user when extracting data, thus decreasing cost and making this an often preferred method of production for data.

Lastly, the paper format is just as it sounds, production completely in paper. As in near-paper, this provides the ability to completely control redaction, Bates stamping, and removal of confidential information. However paper will not include any metadata and, unlike all methods above, will not allow the party to electronically search the information.

Technology That Fits

Currently, discovery extends to any non-privileged matter relevant to any party’s claim or defense, making it very broad and far reaching. Proposed changes to the Federal Rules of Civil Procedure include a change to Rule 26, limiting discovery to be “proportional to the needs of the case.” The motivation for this change stems from an effort to decrease cost and increase efficiency in the age of growing data volumes. The current interpretation of Rule 26 has led to increased costs and delays as firms struggle to sift through large amounts of data even though the expense often far outweighs the benefit to either party. Opinions vary on the impact of such changes but most agree that the proportionality standard will require substantiation beyond legal arguments – namely based on technical and resource expenditures. Most attorneys struggle with these types of arguments, requiring technical experts to weigh in as to the limits and accuracy of the software used when courts require a party to prove proportionality.

Many cases and controversies settle without data volume becoming a factor, and without a thorough examination of documents and emails. A case with 1,000 pages of documents and email may appear too small for technology, yet paralegals and lawyers struggle to keep the content of that data organized mentally, especially when handling dozens of matters simultaneously. A case with 10,000 pages requires more than the human memory. However, a matter valued at $200,000 may not warrant a five-figure technology investment. What technology fits a case like this?

Practicing attorneys can answer this question without the need to become technology geeks. Simple tools that merely organize data into a table of contents linking file names to file locations provide very little value. Tools that truly assist in analyzing, relating, and understanding the content in a dataset provide the real assistance needed. Software such as Encase, Logikull, iPro, and START:Review fill this space. The current market offers varying solutions, from cloud-based data review to on-site software. With each vendor having their own methods to perform the complicated analyzation required, this allows lawyers to shop around, finding the right features, at the right price, to find those needles in their haystack of data.


E-discovery is an area of the law that every lawyer should embrace and understand enough to work in. Big data is not going away, and sooner or later the time will come that requires the use of complicated software to navigate a case. Understanding how TAR can help you wade through a seemingly impossible task is the first step in embracing e-discovery, and is your obligation to effectively serving your client by providing them more bang for their buck.

This article was originally published in the August 2015 (Vol. 40, Issue 9) edition of Montana Lawyer, the official publication of the State Bar of Montana, and is reprinted with the permission of the authors.


1 Joel Henry, Ph.D., J.D., is a professor of computer science at the University of Montana and adjunct professor of law at the university’s Alexander Blewett III School of Law. Michael Pasque is a third-year candidate for J.D., 2016, from the University of Montana Alexander Blewett III School of Law.

2 Zubulake v. UBS Warburg LLC, 220 F.R.D. 212, 216 (S.D.N.Y. 2003).

3 Id. at 217; Fed. R. Civ. P. 34(a).

4 Id. (emphasis added).

5 Id.; Fed. R. Civ. P. 34(a).

6 Fed. R. Evid. 502(a)-(b).

7 Fed. R. Evid. 502 Advisory Committee Notes.

8 ABA Model R. of Professional Conduct, 1.1, cmt. 1 (2013) (emphasis added).

9 State Bar of Calif. Standing Comm. on Prof’l Responsibility & Conduct, Formal Op. Interim No. 11-0004 (2015).

10 ABA Standing Comm. on Ethics & Prof’l Responsibility, Formal Op. 08-451, (2008).

11 The Sedona Conference, The Sedona Conference Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery, 8 Sedona Conf. J. 189, 199 (2007).

12 Id.

13 Rio Tinto PLC v. Vale S.A., 306 F.R.D. 125, 129 (S.D.N.Y. 2015).

14 Cormack, G. V., & Grossman, M. R., Evaluation of machine-learning protocols for technology-assisted review in electronic discovery, Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 153-162 (ACM 2014).

15 Moore v. Publicis Groupe, 287 F.R.D. 182, 193 (S.D.N.Y. 2012) adopted sub nom. Moore v. Publicis Groupe SA, 11 CIV. 1279 ALC AJP, 2012 WL 1446534 (S.D.N.Y. Apr. 26, 2012); Rio Tinto, supra note 12.

16 Id. at 183.

17 Green v. Am. Modern Home Ins. Co., No. 14–CV–04074, 2014 WL 6668422 at 1 (W.D.Ark. Nov. 24, 2014); Aurora Coop. Elevator Co. v. Aventine Renewable Energy–Aurora W. LLC, No. 12 Civ. 0230, Dkt. No. 147 (D.Neb. Mar. 10, 2014); Edwards v. Nat’l Milk Producers Fed’n, No. 11 Civ. 4766, Dkt. No. 154: Joint Stip. & Order (N.D.Cal. Apr. 16, 2013); Bridgestone Am., Inc. v. IBM Corp., No. 13–1196, 2014 WL 4923014 (M.D.Tenn. July 22, 2014); Fed. Hous. Fin. Agency v. HSBC N.A. Holdings, Inc., 11 Civ. 6189, 2014 WL 584300 at 3 (S.D.N.Y. Feb. 14, 2014); EORHB, Inc. v. HOA Holdings LLC, No. Civ. A. 7409, 2013 WL 1960621 (Del.Ch. May 6, 2013); In re Actos (Pioglitazone) Prods. Liab. Litig, No. 6:11–MD–2299, 2012 WL 7861249 (W.D.La. July 27, 2012) (Stip. & Case Mgmt. Order); Global Aerospace Inc. v. Landow Aviation LP, No. CL 61040, 2012 WL 1431215 (Va.Cir.Ct. Apr. 23, 2012).

18 In re Biomet M2a Magnum Hip Implant Prods. Liabl. Litg., 2013 WL 1729682 & 2013 WL 6405156; Kleen Prods. LLC v. Packaging Corp. of Am., 2012 WL 4498465.