NDSU Search uses the following systems:
- Apache Cayenne
- ORM used to store crawl information to assist with the process.
- Apache Droids incubating
- Framework to create web crawlers.
- Apache Solr
- Web accessible search engine built on top of the well respected Apache Lucene. Well known users include Netflix, Sears, and the White House.
- Apache Tika
- Document parsing API that can handle a wide range of document types by using other libraries. These document types include Microsoft Word, Microsoft Powerpoint, ODF, PDF, RTF, and HTML. In addition the API can extract metadata from media formats such as MP3 and JPG.
- PHP API for interacting with Solr.
How search results work
Where a result shows up in the page listing is dependent on the score that Solr assigns each page. In general, the shorter the area in question, the more it contributes to the score of a page.
In relative order from most important to least important:
- Page or document title
- Link text from other crawled pages that link to the page (inbound links)
- The content of the page with the boilerplate removed
- The content of the page
NDSU Search does not use meta keywords, image alt text, or link title text. It also does not do optical character reading (OCR), so images do not contribute to the page content or score. Pages full of links are effectively penalized because the link text benefits the pages each link points to more than the page the links are on, and lists of links with very little text looks like boilerplate and is scored lower.
If a page should be found with a search term, make sure of the following:
- The term is prominent on the page in plain text
- This does not mean in an image
- This does not mean the abbreviation or acronym of the term or phrase
- The term in the title is best
- Other pages linking to that page using that term
- Links of "here" do not help
- Use the term in a paragraph of text so that it stands out from the boilerplate
For site/content maintainers
ETag and Last-Modified headers
If NDSU Search crawls your site, it is very helpful to the process if your site provides updated ETags and/or accurate Last-Modified headers. The search crawl is incremental and stores ETag and Last-Modified headers. Before a page is retrieved, a HEAD is performed to see if either of those values match previously recorded values. If one does match, the page is skipped, if not, the entire page has to be checked again (downloaded and re-parsed), even if nothing actually changed in the page content. Apache HTTPD sets these values automatically for files by default, and NDSU CMS is configured to provide these headers for CMS pages. If you are using another Web provider, visit with your server administrator or Web developer to find out if your site uses those headers properly.
Document titles (in results)
The titles for search results are read from the document title, if available. If no document title is provided, the file name will appear as the result title in NDSU Search results. For Web pages, the document title means the <title> attribute. For other documents, the document must provide the title using application-specific techniques.
MS Office applications (doc, xls, ppt)
Office 2007, 2010 (Windows) - click the Microsoft Office Button and choose Prepare > Properties. In the Title field, enter the document title. Save as usual (or save as PDF, as appropriate). See also: Microsoft 2007 help.
Office 2011 (Mac) - click File > Properties, enter the document title in the Title field. Safe as usual (or save as PDF, as appropriate).
The technique to set the title for PDF files varies depending what PDF generator/program you use.
MS Word - set the Word document title (see MS Office applications).
Adobe Writer - choose File > Properties and click the Description tab. Enter a title in the Title field. Save the document and reopen to see the result.
Pages not showing up for the search keywords you expect? Consider the text used for inbound links. If the link text is "here," "click here" or some variant thereof instead of "safety training," the page is ranked highly for the key words "click" and "here" instead of safety and training! Use meaningful link text for better results.