Image from

UC Berkeley Enron Email Analysis






    UC Berkeley Enron Email Analysis Project

    Starting with the Enron Email dataset made available by MIT, SRI, and CMU, we have put together several resources:

    • A set of categories developed in our ANLP (Applied Natural Processing Language Processing) course, to be used for annotating a subset of the Enron email messages.

    • A subset of about 1700 labeled email messages (4.5M). These were chosen in a semi-motivated fashion (focusing on business-related emails and the California Energy Crises and on emails that occurred later in the collection, trying to avoid very personal messages, jokes, and so on). Students in the ANLP course annotated the selected messages with the category labels. Each message was labeled by two people, but no claims of consistency, comprehensiveness, nor generality are made about these labelings.

    • The Enronic email visualization and clustering tool by Jeff Heer, built on his prefuse toolkit.   (1.9M jar file)

    • A database representation(219 MB compressed) of the Enron email collection, built by Andrew Fiore and Jeff Heer, containing the enron email messages. This version contains many but not all of the tables used in the search tool, as well as special tables to be used with the Enronic visualization tool. Andrew did a substantial amount of processing on the contents of the database to remove duplicates, normalize names, and so on. This has been tested only on mysql.