Navigation

New Hardware for searching 5 million+ volumes of full-text

On November 19, 2009, we put new hardware into production to provide full-text searching against about 4.6 million volumes.  Currently we have about 5.3 million volumes indexed.  Below is a brief description of our current production hardware.  Future posts will give  details about performance and background on our experiments with different system architectures and configurations.

Hardware details

Solr Server configuration

  • Dell PowerEdge R710
  • 2 x Quad Core Intel Xeon E5540 2.53GHz processors (Nehalem)
  • 72 GB RAM
  • Red Hat Enterprise Linux  5.4 (kernel: 2.6.18 X86_64)
  • Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
  • Solr 1.3.0.2009.09.03.11.14.39 (1.4-dev 793569)
  • Tomcat 5.5.27

Storage

  • Isilon IQ NAS cluster (20 I/X-series nodes, 4 GB RAM per node)
  • 480 750GB or 1TB SATA drives providing 420 TB raw storage
  • 4GB RAM per node giving 80 GB of coherent cache in aggregate

Network

  • NFS uses a dedicated/private 9K MTU GbE network on Dell PowerConnect 5448 switch
  • NFS clients single-homed and mounts automatically distributed across all cluster nodes

Current Solr Architecture and Configuration

Search  Servers

  •  4 Servers with one Tomcat and 3 shards per server; 10 of 12 shards currently in use
  • 16 GB allocated to the JVM

Indexing  Server

  •  1 Server with 12 Tomcats and 12 shards; 10 of 12 tomcats/shards currently in use
  •  6 GB allocated to each of 10 JVMs

 

What kind of startup

What kind of startup arguments are you using for each Tomcat instance?

storage media, longevity, reliability

I had some questions as a member of the Senate Library Committee at the University of Minnesota and Wendy Lougee has referred me to this page. If I read this correctly, you are storing all the data on magnetic disks without any backup to CD's, DVD's or other media. Is that right? What is the estimated lifetime of data on the media you are using? I inferred from Wendy's remarks that you rewrite the disks frequently to be assured of preservation. How frequently? What happens if you have a major infrastructure disruption which is not brief? Does all the data get lost? If the answer to the last question is yes, are you considering a more permanent backup medium for preserving this heritage through societal disruptions, which , in my opinion, are not unlikely on a timescale of several decades? Thanks for your attention.

Thank you for your questions.

Thank you for your questions. We've added some information at http://www.hathitrust.org/technology that should help to answer them.

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.