|
|||||||
|
Previous: Obtain the necessary software and code
Convert your dataset to the appropriate format
Determine facets and attributes
When converting a new dataset, you need to classify each feature in the dataset as either an attribute which will appear only in the metadata displayed alongside the item, or as a facet using which users are allowed to browse through the dataset.
For example, in the Fine Arts Museum collection, "Media," "Location," and "Date" are facets because different photos can share the same location, media type or date, and the user may want to search for all photographs in a certain medium (such as drawing or sculpture). On the other hand, the image record number is an attribute because few users will want to search for all photographs with a certain record number, though they are likely to want that information once they locate a useful photograph.
Note that "facets" are browseable item characteristics. Contrastingly, attributes are only shown after an image is found.
Create tab-delimited (tsv) files
You will need to create the following tab spaced text files:
- attrs.tsv
- facets.tsv
- items.tsv
- [facetname]_hierarchy.tsv (for every facet you decide to have)
- [facetname]_item_mapping.tsv (for every facet you decide to have)
- fulltext.tsv
- sortkeys.tsv
(Note: Large tab spaced files can be easily manipulated using Excel. Also, samples of all files you're required to generate can be found in the Resources section.)
attrs.tsv and facets.tsv
attrs.tsv should be a list of the attributes you've decided on for your system. Each row in this file should represent a single attribute. For each of these rows, the first column should be the underlying system name for this attribute. The second column should be the display name you'd like users to see. A simplified attrs.tsv file for a collection of articles might look something like:
item PMID title Title This file is most easily generated by hand.
Similarly, facets.tsv should have a row for each of the facets you'd like your system to use. Like attrs.tsv, the first two columns of this file should be the underlying system name and the display name respectively. The third column is just a textual descriptor or comment of the facet; it is not used by the system but should be included for latter reference. A simplified facets.tsv file might look something like:
journal Journal Short name of the journal in which article appears date_created Date Date the article was created The only thing to note is that the identifiers in the first column of both files should be one word long only. In the attrs.tsv file, this identifier should be consistent with the column names of the items table. Remember, attrs or attributes are things that will only show up in the endgame view whereas the facet list descriptors will be used for navigation.
items.tsv
This file should contain one line for each item in your collection. For every row, values should exist for every attribute your system is using. (Note: Column headings are not included in the actual file). A collection with attribute fields RecordID, color, and date might look something like:
568945 blue 02-03-2001 938932 red 04-30-1999 934983 green 02-22-2000 The thing to note here is that the first column of every single row should be a unique identifier for the item.
[facetname]_hierarchy.tsv (for every facet you decide to have)
Each row in these files should represent a node in the facet hierarchy. The first column should be an identifier for that node, to be used in [facetname]_item_mapping.txt. Subsequent columns should be the values for that node, listed general to specific from left to right. A portion of this file for the location facet, location_hierarchy.txt might look something like:
1 United States California San Francisco 2 United States California Berkeley 3 United States Washington Seattle The thing to note here is that the first column of every single row should be a unique identifier for that node.
[facetname]_item_mapping.tsv (for every facet you decide to have)
Each row in these files should represent the facet hierarchy mapping for that item's facet information. This is perhaps best described with an example. Consider once again, the location facet. If we are using the location_hierarchy.txt file from above, our location_item_mapping.txt file might look something like:
75635 1 434543 1 645654 3 534454 2 This would indicate that item 75625 has location values "United States->California->San Francisco." Likewise, item 645654 would have location values "United States->Washington->Seattle." For facets where items might have multiple values or "multi-valued facets," simply have multiple rows assigning values for that item.
fulltext.tsv
This file can be generated by hand. It will support fulltext searching and is only necessary if you plan to choose MYSQL fulltext searching later as opposed to lucene. For every item in your collection, provide any text associated with that item. The format of the file should be as follows.
001 all the text associated with item 001 002 all the text associated with item 002 003 all the text associated with item 003 sortkeys.tsv
This file will let the system know what attributes or facets you want to sort by. The first column should provide the display value for the sorting option. The second column should provide the name of the facet or attribute in the underlying system. That is, all the values of the second column should be found in either facets.tsv or attrs.tsv. A system filing publications might look something like this:
Journal journal Date date_created
Next: Prepare to run installation scripts
Questions? Comments? Contact Kevin Li (kevinli@sims.berkeley.edu)