University of California, Berkeley

Abstract:

Usability evaluation is an increasingly important part of the iterative design process. Automated usability evaluation has great promise as a way to augment existing evaluation techniques, but is greatly underexplored. We present a new taxonomy for automated usability analysis and illustrate it with an extensive survey of evaluation methods. We present analyses of existing techniques, and suggest which areas of automated usability evaluation are most promising for future research.

H.1.2 Information Systems User/Machine Systems [human factors; human information processing] H.5.2 Information Systems User Interfaces [benchmarking; evaluation/methodology; graphical user interfaces (GUI)] Human Factors Automated Usability Evaluation, Graphical User Interfaces, Web Interfaces, Taxonomy

State of the Art in Automated Usability Evaluation of User Interfaces (DRAFT)

Melody Y. Ivory
EECS Department - Marti A. Hearst
SIMS

Introduction

Usability is the extent to which a computer system can be used by users to achieve specified goals with effectiveness, efficiency and satisfaction in a given context of use. Usability evaluation (UE) is a methodology for measuring these usability aspects of a system's user interface and identifying specific problems with the interface [Dix, Finlay, Abowd, and BealeDix et al.1993, NielsenNielsen1993]. Usability evaluation is an important part of the overall user interface iterative design process, which consists of cycles of designing, prototyping and evaluation [Dix, Finlay, Abowd, and BealeDix et al.1993, NielsenNielsen1993]. Usability evaluation is itself a process that entails many activities: specifying evaluation goals, identifying target users, selecting usability metrics, selecting an evaluation method and tasks, designing experiments, collecting usability data, and analyzing and interpreting data.

A wide range of usability evaluation techniques have been proposed, and a subset of these are currently in common use. Some evaluation techniques, such as formal user testing, can only be applied after the interface design has been implemented. Others, such as heuristic evaluation, can be applied in the early stages of design. Each technique has its own requirements, and generally different techniques uncover different usability problems.

Usability findings can vary widely when different evaluators study the same user interface, even if they use the same evaluation technique [Jeffries, Miller, Wharton, and UyedaJeffries et al.1991, Molich, Bevan, Butler, Curson, Kindlund, Kirakowski, and MillerMolich et al.1998, Molich, Thomsen, Karyukina, Schmidt, Ede, van Oel, and ArcuriMolich et al.1999, NielsenNielsen1993]. Two studies in particular, the first and second comparative user testing studies (CUE-1 [Molich, Bevan, Butler, Curson, Kindlund, Kirakowski, and MillerMolich et al.1998] and CUE-2 [Molich, Thomsen, Karyukina, Schmidt, Ede, van Oel, and ArcuriMolich et al.1999]), demonstrated less than a 1% overlap in findings among 4 and 8 independent usability testing teams for evaluations of two user interfaces. This result implies a lack of systematicity or predictability in the findings of usability evaluations. Furthermore, usability evaluation typically only covers a subset of the possible actions users might take. For these reasons, usability experts often recommend using several different evaluation techniques [Dix, Finlay, Abowd, and BealeDix et al.1993, NielsenNielsen1993].

How can systematicity of results and fuller coverage in usability assessment be achieved? One solution is to increase the number of usability teams evaluating the system, and increase the number of study participants. An alternative is to make use of automated usability evaluation (AUE) methods.

There are successful precedents for the use of (partially) automated usability evaluation. For example, GOMS analysis [John and KierasJohn and Kieras1996] is an analytical modeling technique for predicting task execution and learning times. This technique was used to compare telephone operators' performance on an existing and a proposed user interface [Gray, John, and AtwoodGray et al.1992]. A GOMS analysis accurately predicted that the proposed workstation would result in a performance degradation. The evaluators estimated additional operating costs of $2.4 million dollars a year due to this loss.

Another example of a successfully employed AUE technique is operationalized guidelines. Interface designers can find design guidelines difficult to use because they are too ambigous or too voluminous [Borges, Morales, and RodriguezBorges et al.1996, Lowgren and NordqvistLowgren and Nordqvist1992]. As an alternative, operationalized guidelines automatically identify certain types of guideline violations and in some cases generate corrected interface designs [BalboBalbo1995, Lowgren and NordqvistLowgren and Nordqvist1992, SearsSears1995].

In addition to uncovering varied types of errors and increasing the coverage of features evaluated, automated usability evaluation should enable comparisons between alternative designs and prediction of time and error costs across an entire design. They should reduce the need for evaluation expertise among individual developers, improve the consistency in problems found, and reduce the cost of usability evaluation as compared to standard techniques. Some automated evaluation techniques can be embedded within the design phase of UI development, as opposed to being applied after implementation. This is important because evaluation with most traditional methods can done only after the interface has been built and changes are more costly [NielsenNielsen1993].

It is important to note that we consider automated techniques to be a useful complement and addition to standard evaluation techniques such as heuristic evaluation and user testing - not a substitute. Different techniques uncover different kinds of problems [Jeffries, Miller, Wharton, and UyedaJeffries et al.1991, Molich, Bevan, Butler, Curson, Kindlund, Kirakowski, and MillerMolich et al.1998, Molich, Thomsen, Karyukina, Schmidt, Ede, van Oel, and ArcuriMolich et al.1999], and subjective measures such as user satisfaction are unlikely to be determined in an automated manner.

Despite the potential advantages, the space of automated usability evaluation is only spottily explored. In this article we discuss the state of the art in automated usability evaluation. Section 2 presents a taxonomy for classifying UE automation, and Section 3 provides an overview of the application of the classification to 108 usability methods. Sections 4-8 describe these methods in more detail, including assessments of the automated methods. The results of this survey suggest promising ways to expand existing methods to better support automated usability evaluation.

Taxonomy of Automated Usability Evaluation

In this discussion, we make a distinction between WIMP (Windows, Icons, Pointer, and Mouse) interfaces and Web interfaces, in part because the nature of these interfaces differ and in part because the usability methods discussed have often only been applied to one type or the other. WIMP interfaces tend to be more functional than Web interfaces; users complete tasks, such as opening or saving a file, by following specific sequences of operations. Although there are some functional Web applications, most Web interfaces offer limited functionality (i.e., selecting links or completing forms) and provide information. In fact, the navigation structure over site information is a major component of the interface.

Several surveys of UE methods for WIMP interfaces exist; hom-umt and zhang-uem provide a detailed discussion of inspection, inquiry and testing methods. Several taxonomies of UE methods have also been proposed. The most commonly used taxonomy is one that distinguishes between predictive (e.g., GOMS analysis and cognitive walkthrough) and experimental (e.g., user testing) techniques [CoutazCoutaz1994]. Whitefield et al. Whitefield91 present another classification scheme based on the presence or absence of a user and a computer. Neither of these taxonomies reflect the automation aspects of UE methods.

The sole existing survey of automated usability evaluation, by Balbo [BalboBalbo1995], uses a taxonomy which distinguishes among four features of automation:

Non Automatic - no level of automation supported (i.e., evaluator performs method).
Automatic Capture - software automatically captures interface usage (e.g., logging).
Automatic Analysis - automatic identification of usability problems.
Automatic Critique - automatic analysis coupled with automated suggestions for improvements.

Balbo uses these categories to classify 13 common and uncommon UE methods. However, most of the methods surveyed require extensive human effort, because they rely on formal user testing and/or require extensive evaluator interaction. For example, Balbo classifies several techniques for processing log files as automatic analysis methods despite the fact that these approaches require formal testing or informal use to generate those log files. What Balbo calls an automatic critique method may require the evaluator to create a complex UI model as input. Thus, this classification scheme is somewhat misleading since it ignores the non-automated requirements of the UE methods.

We expand this taxonomy to include consideration of a method's non-automated testing requirements, both in terms of users and evaluators. We augment each of Balbo's features with an attribute called testing level; this indicates the human testing effort required for execution of the method:

Minimal Effort - does not require testing or modeling.
Informal Use - requires normal use (i.e., unstructured tasks completed by a user or evaluator).
Model Development - requires the evaluator to develop a UI model and/or a user model in order to employ the method.
Formal Study - requires a user or evaluator to complete a set of structured tasks and/or a procedure.

Finally, we group existing UE methods into the 5 general classes: testing, inquiry, inspection, analytical modeling and simulation.

Testing - an evaluator observes users interacting with an interface (i.e., completing tasks) to determine usability problems.
Inspection - an evaluator uses a set of criteria to identify potential usability problems in an interface.
Inquiry - users provide feedback on an interface via interviews, surveys, and other methods.
Analytical Modeling - an evaluator employs user and interface models to generate quantitative usability predictions.
Simulation - an evaluator employs user and interface models to mimic a user interacting with an interface and report the results of this interaction (e.g., simulated activities, errors and other quantitative measures).

Both testing and inspection are formative (i.e., they identify specific usability problems) unlike inquiry methods, which are summative (i.e., they provide general assessments of usability). Analytical modeling and simulation are engineering approaches to UE that enable evaluators to predict usability with user and interface models. Software engineering practices have had a major influence on the first three classes, while the latter two, analytical modeling and simulation, are quite similar to performance evaluation techniques used to analyze the performance of computer systems [Ivory and HearstIvory and Hearst1999, JainJain1991]. Table 1 maps these method classes into automation and testing combinations.

table76
Table 1: Methods associated with combinations of automation type and testing level.

In summary, the taxonomy consists of: a UE method type (testing, inquiry, inspection, analytical modeling and simulation); an automation type (none, capture, analysis and critique); and a testing level (minimal, informal, model and formal). In the remainder of this article, we use this taxonomy to analyze UE evaluation methods.

Summary of Automated Usability Evaluation Methods

We surveyed 58 UE methods applied to WIMP interfaces, and 50 methods applied to Web UIs. Of these 108 methods, only 31 apply to both Web and WIMP UIs. Table 2 combines survey results for both types of interfaces showing automation type and testing level. For some methods, we will discuss more than one approach; hence, we show the number of methods surveyed in parenthesis beside the testing level. There are major differences in automation among the 5 types of methods. Overall, automation patterns are similar for WIMP and Web interfaces, with the exception that analytical modeling and simulation methods are far less explored in the Web domain than for WIMP interfaces (1 vs. 15 methods). Appendix A shows the information in Table 2 separated by UI type.

table110
Table 2: Automation characteristics of WIMP and Web UE methods. A number in parentheses indicates the number of methods surveyed for a particular method and automation type. The testing level for each method is represented as: minimal (blank), formal (F), informal (I) and model (M). The * for the FIM entry indicates that either formal or informal testing is required. In addition, a model may be used in the analysis.

Table 2 shows that AUE in general is greatly underexplored. Non automatic methods represent 65% of the methods surveyed, while automated methods collectively represent only 35%. Of this 35%, automatic capture methods represent 18%, automatic analysis methods represent 15% and automatic critique methods represent 2%. All but two of the automatic capture and log file analysis methods require some level of testing; genetic algorithms and information scent modeling employ simulation to generate usage data. Hence, only 20% of the automated methods do not require formal testing or informal use to employ.

To be fully automated, an AUE method would provide the highest level of automation (i.e., critique) and require no user testing. Our survey found that this level of automation has been accomplished using only one method: guideline reviews [Lowgren and NordqvistLowgren and Nordqvist1992]. Operationalized guidelines automatically detect and report usability violations and then make suggestions for fixing them (discussed further in Section 5).

Of those methods that support the next level of automation (i.e., analysis), Table 2 shows that analytical modeling and simulation methods represent the majority. These methods support automatic analysis without requiring formal testing or informal use. Most of these methods embed analysis within the design phase of UI development, as opposed to employment after development.

The next sections discuss the various UE types and their automation levels in more detail. Most methods are applicable to both WIMP and Web interfaces, however, we make distinctions where necessary about a method's applicability. We also present our assessments of automatic capture, analysis and critique techniques using the following criteria:

Effectiveness: how well does a method inform UI improvements,
Ease of use: how easy is a method to employ,
Effort to learn: how easy is a method to learn, and
Applicability: how widely applicable is a method to WIMP and/or Web UIs other than those originally applied to.

We discuss the effectiveness, ease of use, effort to learn, and applicability of automated methods within each class of techniques.

User Testing Methods

Usability testing with real participants is one of the most fundamental usability evaluation methods [NielsenNielsen1993]. It provides an evaluator with direct information about how people use computers and what some of the problems are with the interface being tested. During usability testing, participants use the system or a prototype to complete a pre-determined set of tasks while the tester records the results of the participants' work. The tester then uses these results to determine how well the interface supports users' task completion as well as other measures, such as number of errors and task completion time.

Automation has been used predominantly in two ways within user testing: automatic capture of use data and automatic analysis of this data according to some metrics or a model (referred to as log file analysis in Table 2). In rare cases methods support both automatically capturing and analyzing usage data [Macleod and RenggerMacleod and Rengger1993].

Automatic Capture Methods

Many usability testing methods require the recording of the actions a user makes while exercising an interface. This can be done by an evaluator taking notes while the participant uses the system, either live or by repeatedly viewing a videotape of the session, a time-consuming activity. As an alternative, automatic capture techniques can log user activity automatically. An important distinction can be made between information that is easy to record but difficult to interpret (e.g., keystrokes) and information that is meaningful but difficult to automatically label, such as task completion.

Within the testing category of UE, automatic capture of usage data is supported by two methods: performance measurement and remote testing. Both require the instrumentation of a user interface.

Performance measurement techniques automatically record time stamps along with usage data, thus enabling the evaluator to consider quantitative usage data during analysis. Without automation, the evaluator must use a timer to record timing information in addition to usage observations, leading to less accurate measurements.

Most video recording and event logging tools record time stamps along with usage data [Hammontree, Hendrickson, and HensleyHammontree et al.1992, Bevan and MacleodBevan and Macleod1994, Macleod and RenggerMacleod and Rengger1993]. Some video recording tools (e.g., [Hammontree, Hendrickson, and HensleyHammontree et al.1992]) record events at the keystroke level. Recording data at this level produces voluminous log files and makes it difficult to map recorded usage into high-level tasks. As an alternative, the European MUSiC (Metrics for Usability Standards in Computing) project developed DRUM [Macleod and RenggerMacleod and Rengger1993, Bevan and MacleodBevan and Macleod1994] to allow evaluators to specify events (e.g., tasks or feature usage) to log during user testing. DRUM synchronizes the occurence of events within a user interface with videotaped footage, thus speeding up video analysis. (DRUM also generates performance measures from logged events; see Section 4.2.)

Remote testing is a method that enables testing between a tester and participant who are not co-located. In this case the tester is not able to observe the testing process directly, but can gather data about the process over a computer network. Same-time different-place and different-time different-place are two major remote testing approaches [Hartson, Castillo, Kelsa, and NealeHartson et al.1996]. In same-time different-place or remote-control testing the tester observes the participant's screen through network transmissions (e.g., using PC Anywhere or Timbuktu) and may be able to hear what the participant says during the test via a speaker telephone or the computer.

Journaled sessions [NielsenNielsen1993] is a form of different-time different-place testing in which software guides the participant through a testing session and logs the results. Evaluators can use this approach with prototypes to get feedback early in the design process, as well as with released products. In the early stages, evaluators distribute disks containing a prototype of a software product and embedded code for recording users' actions. Users experiment with the prototype and return the disks to evaluators upon completion. It is also possible to embed dialog boxes within the prototype in order to record user comments or observations during usage. For released products, evaluators use this method to capture statistics about the frequency with which the user has used a feature or the occurrence of events of interest (e.g., error messages). This information is valuable for optimizing frequently-used features and the overall usability of future releases.

Remote testing approaches allow for wider testing than traditional methods, however, most techniques have restrictions on the types of UIs to which they can be applied. This is mainly determined by the underlying hardware (e.g., PC Anywhere only operates on PC platforms). Evaluators may also experience technical difficulties with hardware and/or software components, especially for same-time different-place testing.

The Web enables remote testing and performance measurement on a much larger scale than is economically feasible with WIMP interfaces. Similar to journaled sessions, Web servers maintain usage logs and automatically generate a log file entry for each request. These entries include the IP address of the requester, request time, name of the requested Web page, and in some cases the URL of the referring page (i.e., where the user came from). Server logs track only unique navigational events, since they cannot record user interactions that occur on the client side only (e.g., use of within-page anchor links, back button or cached pages). Furthermore, the validity of the data is questionable due to caching by proxy servers and browsers [Etgen and CantorEtgen and Cantor1999, Scholtz and LaskowskiScholtz and Laskowski1998]. Client-side logging captures more accurate, comprehensive usage data than server-side logs because it enables all browser events to be recorded. However, it requires every Web page to be instrumented to log usage data, use of an instrumented browser, or use of a special proxy server.

The NIST WebMetrics tool suite [Scholtz and LaskowskiScholtz and Laskowski1998] supports remote testing of a Web site. WebVIP (Visual Instrumentor Program) is a visual tool that enables the evaluator to add event handling code to Web pages. This code automatically records the page identifier and a time stamp in an ASCII file every time a user selects a link. Using this client-side data, the evaluator can accurately measure time spent on tasks or particular pages as well as study use of the back button and user paths. Despite its advantages over server-side logging, WebVIP requires the evaluator to make a copy of an entire Web site, which could lead to invalid path specifications and difficulties getting the copied site to function properly. The evaluator must also add logging code to each individual link on a page. Since WebVIP only collects data on selected HTML links, it does not record interactions with other Web objects, such as forms. It also does not record usage of external or non-instrumented links.

Similar to WebVIP, the Web Event-logging Tool (WET) [Etgen and CantorEtgen and Cantor1999] supports the capture of client-side data, including clicks on Web objects, window resizing, typing in a form object and form resetting. WET interacts with Microsoft Internet Explorer and Netscape Navigator to record browser event information, including the type of event, a time stamp, and the document-window location. This gives the evaluator a more complete view of the user's interaction with a Web interface than WebVIP. WET does not require as much effort to employ as WebVIP, not does it suffer from the same limitations. To use this tool, the evaluator specifies events (e.g., clicks, changes, loads and mouseovers) and event handling functions in a text file on the Web server; sample files are available to simplify this step. The evaluator must also add a single call to the text file within the HEAD tag of each Web page to be logged. Currently, the log file analysis for both WebVIP and WEB is manual. Future work has been proposed to automate this analysis.

The NIST WebMetrics tool suite also includes WebCAT (Category Analysis Tool), a tool that aids in Web site category analysis, by a technique sometimes known as card sorting [NielsenNielsen1993]. In non-automated card sorting, the evaluator (or a team of evaluators) writes concepts on pieces of paper, and users group the topics into piles. The evaluator manually analyzes these groupings to determine a good category structure. WebCAT allows the evaluator to test proposed topic categories for a site via a category matching task; this task can be completed remotely by users. Results are compared to the designer's category structure, and the evaluator can use the analysis to inform the best information organization for a site. WebCAT enables wider testing and faster analysis, and helps make the technique scale for a large number of topic categories.

Automatic capture methods represent important first steps toward informing UI improvements - they provide input data for analysis and in the case of remote testing, enable the evaluator to collect data for a larger number of users than traditional methods. Without this automation, evaluators would have to manually record usage data, expend considerable time reviewing videotaped testing sessions or in the case of the Web, rely on questionable server logs. Methods such as DRUM and WET capture high-level events that correspond to specific tasks or UI features. DRUM also supports automated analysis of captured data, discussed below.

It is difficult to assess the ease of use and learning of these approaches, especially DRUM and remote testing approaches that require integration of hardware and software components, such as video recorders and logging software. For Web site logging, WET appears to be easier to use and learn than WebVIP. It requires the creation of an event handling file and the addition of a small block of code in each Web page header, while WebVIP requires the evaluator to add code to every link on all Web pages. WET also enables the evaluator to capture more comprehensive usage data than WebVIP. WebCAT appears straightforward to use and learn for topic category analysis.

Only the remote testing techniques have restrictions on the types of UIs to which they can be applied. This is mainly determined by the underlying hardware (e.g., PC Anywhere only operates on PC platforms).

Automatic Analysis Methods

Log file analysis methods support automatic analysis of data captured during formal testing or informal use. Since Web servers automatically log client requests, log file analysis is a heavily used methodology for evaluating Web interfaces [DrottDrott1998, Fuller and de GraaffFuller and de Graaff1996, Hochheiser and ShneidermanHochheiser and Shneiderman1999, SullivanSullivan1997]. Our survey reveals five general approaches for analyzing WIMP and Web log files: metric-based, pattern-matching, task-based, task-based pattern-matching, and inferential.

Metric-based Analysis of Log Files. Metric-based approaches generate quantitative performance measurements. Two examples for WIMP interfaces are DRUM and the MIKE UIMS (User Interface Management System) [Olsen, Jr. and HalversenOlsen, Jr. and Halversen1988]. DRUM captures usage data and derives the following measurements: task effectiveness (i.e., how correctly and completely tasks are completed), user efficiency (i.e., effectiveness divided by task completion time), productive period (i.e., portion of time the user did not have problems) and learnability (i.e., comparison of the user's and expert user's efficiency for a task). DRUM also detects critical incidents specified by the evaluator, such as errors during task completion.

The MIKE UIMS enables an evaluator to assess the usability of a UI specified as a model that can be rapidly changed and compiled into a functional UI. MIKE captures usage data and generates a number of general, physical, logical and visual metrics, including performance time, command frequency, the number of physical operations required to complete a task, and required changes in the user's focus of attention on the screen. MIKE also calculates these metrics separately for command selection (e.g., traversing a menu, typing a command name or hitting a function button) and command specification (e.g., entering arguments for a command) to help the evaluator locate specific problems within the UI.

For the Web, site analysis tools developed by Service Metrics [Service MetricsService Metrics1999] and others [BacheldorBacheldor1999] allow evaluators to pinpoint performance bottlenecks, such as slow server response time, that may negatively impact the usability of a Web site. Service Metrics' tools supports this kind of performance analysis, including software that can collect these measures from multiple geographical locations under various access conditions. These approaches focus on server and network performance, but provide little insight into the usability of the Web site itself.

Pattern-Matching Analysis of Log Files. Pattern-matching approaches, such as MRP (Maximum Repeating Pattern) [Siochi and HixSiochi and Hix1991], analyze user behavior captured in logs. MRP detects and reports repeated user actions (e.g., consecutive invocations of the same command and errors) that may indicate usability problems. Studies with MRP showed the technique to be useful for detecting problems with expert users, but additional data prefiltering was required for detecting problems with novice users. Whether the evaluator performed this prefiltering or it was automated is unclear in the literature.

Task-based Analysis of Log Files. Task-based approaches analyze discrepancies between the designer's task model and actual use. For example, the IBOT system [Zettlemoyer, Amant, and DulbergZettlemoyer et al.1999] automatically analyzes log files to detect task completion events. The IBOT system interacts with Windows operating systems to capture low-level window events (e.g., keyboard and mouse actions) and screen buffer information (i.e., a screen image that can be processed to automatically identify widgets). The system then combines this information into higher-level abstractions (e.g., menu select and menubar search operations). Evaluators can use the system to compare user and designer behavior on high-level tasks and to recognize patterns of inefficient or incorrect behaviors during task completion. Without such a tool, the evaluator has to study the log files and do the comparison manually. Future work has been proposed to support automated critique.

The QUIP (Quantitative User Interface Profiling) tool [Helfrich and LandayHelfrich and Landay1999] provides one of the most advanced approaches to task-based, log file analysis and visualization for Java-based UIs. Unlike other approaches, QUIP aggregates traces of multiple user interactions and compares the task flows of these users to the designer's task flow (see diagonal shading in Figure 1). QUIP encodes quantitative time-based and trace-based information - such as the average time between actions (color of each link), and the proportion of users who performed a particular sequence of actions (width of each link) - into directed graphs. Currently, the evaluator must analyze graphs to identify usability problems. The evaluator must also instrument the UI to collect the necessary usage data.

Figure 1: QUIP usage profile contrasting task flows for 2 users to the designer's task flow (diagonal shading). Each node represents a user action, and directed arcs indicate actions taken by users. The width of arcs denote the fraction of users completing actions, while the color of arcs reflect the average time between actions (darker colors correspond to longer time).

Task-based Pattern-matching Analysis of Log Files. ÉMA (Automatic Analysis Mechanism for the Ergonomic Evaluation of User Interfaces) [BalboBalbo1996] and USINE (USer Interface Evaluator) [Lecerof and PaternóLecerof and Paternó1998] combine task-based and pattern-matching techniques.

ÉMA uses a manually-created data-flow task model and standard behavior heuristics to flag usage patterns that may indicate usability problems. ÉMA extends the MRP approach (repeated command execution) to detect additional patterns, including immediate task cancellation, shifts in direction during task completion, and discrepancies between task completion and the task model. ÉMA outputs results in an annotated log file, which the evaluator must inspect to identify usability problems. Application of this technique to the evluation of ATM (Automated Teller Machine) usage corresponded with problems identified using standard heuristic evaluations.

USINE (USer Interface Evaluator) [Lecerof and PaternóLecerof and Paternó1998] employs the ConcurTaskTrees [Paternó, Mancini, and MeniconiPaternó et al.1997] notation to express temporal relationships among UI tasks (e.g., enabling, disabling, and synchronization). Using this additional information, USINE looks for precondition errors (i.e., task sequences that violate temporal relationships) and also reports quantitative metrics (e.g., task completion time) and information about task patterns, missing tasks and user preferences reflected in the usage data. Studies with a graphical interface showed that USINE's results correspond with empirical observations and highlight the source of some usability problems. To use the system, evaluators must create task models using the ConcurTaskTrees editor as well as a table specifying mappings between log entries and the task model. USINE processes log files and outputs detailed reports and graphs to highlight usability problems.

Inferential Analysis of Log Files. Inferential analysis of Web log files includes both statistical and visualization techniques. Statistical approaches, such as traffic-based analysis (e.g., pages-per-visitor or visitors-per-page) and time-based analysis (e.g., click paths and page-view durations) [DrottDrott1998, Fuller and de GraaffFuller and de Graaff1996, SullivanSullivan1997, Theng and MarsdenTheng and Marsden1998], report a number of measures from usage data. Some methods require manual pre-processing or filtering of the logs before analysis. Futhermore, the evaluator must interpret reported measures in order to identify usability problems. This analysis is largely inconclusive for server logs, since they provide only a partial trace of user behavior and timing estimates may be skewed by network latencies. Server log files are also missing valuable information about what tasks users want to accomplish [Byrne, John, Wehrle, and CrowByrne et al.1999]. Nonetheless, inferential analysis techniques have been useful for improving usability and enable ongoing, cost-effective evaluation throughout the life of a site [Fuller and de GraaffFuller and de Graaff1996, SullivanSullivan1997].

Visualization is also used for inferential analysis of Web and WIMP log files [Guzdial, Santos, Badre, Hudson, and GrayGuzdial et al.1994, Hochheiser and ShneidermanHochheiser and Shneiderman1999]. Starfield visualization [Hochheiser and ShneidermanHochheiser and Shneiderman1999] is one approach that enables evaluators to interactively explore server log data in order to gain an understanding of human factors issues related to visitation patterns. This approach combines the simultaneous display of a large number of individual data points (e.g., URLs requested versus time of requests) in an interface that supports zooming, filtering and dynamic querying [Ahlberg and ShneidermanAhlberg and Shneiderman1994]. Visualizations provide a high-level view of usage patterns (e.g., usage frequency, correlated references, bandwidth usage, HTTP errors and patterns of repeated visits over time) that the evaluator must explore to identify usability problems. As such, it would be beneficial to employ a statistical inferential approach, such as time-based log file analysis, prior to exploring visualizations.

Dome Tree visualization [Chi, Pirolli, and PitkowChi et al.2000] provides a more insightful representation of simulated (see Section 8) and actual Web usage captured in log files. This approach maps a Web site into a three dimensional surface representing the hyperlinks (see top part of Figure 2). The location of links on the surface are determined by a combination of content similarity, link usage and link structure of Web pages. The visualization highlights the most commonly traversed subpaths. An evaluator can explore these usage paths to gain insight about the information ``scent'' (i.e., common topics among Web pages on the path) as depicted in the bottom window of Figure 2. This additional information helps the evaluator infer what the information needs of site users are, and more importantly, helps infer whether the site satisfies those needs. The Dome Tree visualization also reports a crude path traversal time based on the sizes of pages (i.e., number of bytes in HTML and image files) along the path. As is the case for Starfield visualization, it would be beneficial to employ a statistical inferential approach prior to site exploration with this approach.

Figure 2: Dome Tree visualization of a Web site with a usage path displayed. The bottom part of the figure displays information about the usage path, including an estimated navigation time and information scent (i.e., common keywords along the path).

Although the log file analysis techniques vary widely on the four assessment criteria, all approaches offer substantial benefits over the alternative - time-consuming, unaided analysis of potentially large amounts of raw data. Task-based and task-based pattern-matching techniques like USINE may be the most effective (i.e., provide clear insight for improving usability via task analysis), however, they require additional effort and learning time over simpler pattern-matching approaches; this additional effort is mainly in the development of task models. Although easier to use and learn, pattern-matching approaches only detect problems for pre-specified usage patterns. Metric-based approaches in the WIMP domain have been effective at associating measurements with specific interface aspects (e.g., commands and tasks), which can then be used to identify usability problems. However, the evaluator must conduct more analysis than is required for task-based approaches that compare designer and user task flows. Metric-based techniques in the Web domain focus on server and network performance, which provides little usability insight. Similarly, inferential analysis of Web server logs provides inconclusive usability information.

Most of the techniques surveyed could be applied to WIMP and Web UIs other than those demonstrated on, with the exception of the MIKE UIMS which requires a WIMP UI to be developed within a special environment.

Inspection Methods

A large number of detailed usability guidelines have been developed for interface design [Open Software FoundationOpen Software Foundation1991, Smith and MosierSmith and Mosier1986]. A usability inspection is an informal evaluation methodology whereby an evaluator examines the usability aspects of a UI design with respect to its conformance to a set of guidelines. Unlike other UE methods, inspections rely solely on the evaluator's judgment as a source of evaluation feedback. Common non-automated inspection techniques are heuristic evaluation [NielsenNielsen1993] and cognitive walkthroughs [Lewis, Polson, Wharton, and RiemanLewis et al.1990].

Automation has been predominately used within the inspection class to check guideline conformance. Operationalized guidelines automatically detect and report usability violations and in some cases make suggestions for fixing them.

Automatic Capture Methods

During a cognitive walkthrough, an evaluator attempts to simulate a user's problem-solving process while examining UI tasks. At each step of a task, the evaluator assesses whether a user would succeed or fail to complete the step. Hence, the evaluator produces extensive documentation during this analysis. There was an early attempt to ``automate'' cognitive walkthroughs by prompting evaluators with walkthrough questions and enabling evaluators to record their analyses in HyperCard [Rieman, Davies, Hair, Esemplare, Polson, and LewisRieman et al.1991]. Unfortunately, evaluators found this approach too cumbersome and time-consuming to employ.

Automatic Analysis Methods

Several automatic tools use guidelines to evaluate the layout of graphical screens. Parush98 developed and validated a tool for computing the complexity of Visual Basic dialog boxes. It considers changes in the size of screen elements, the alignment and grouping of elements as well as the utilization of screen space in its calculations. User studies demonstrated that tool results can be used to decrease screen search time and ultimately to improve screen layout. AIDE (semi-Automated Interface Designer and Evaluator) [SearsSears1995] is a more advanced tool that helps designers assess and compare different design options using quantitative task-sensitive and task-independent metrics, including efficiency (i.e., distance of cursor movement), vertical and horizontal alignment of elements, horizontal and vertical balance, and designer-specified constraints (e.g., position of elements). AIDE also employs an optimization algorithm to automatically generate initial UI layouts. Studies with AIDE showed it to provide valuable support for analyzing the efficiency of a UI and incorporating task information into designs.

Without automated analysis tools, designers must simultaneously consider task information and design principles during an inspection. However, designers have historically experienced difficulties following design guidelines [Borges, Morales, and RodriguezBorges et al.1996, Lowgren and NordqvistLowgren and Nordqvist1992]. One study has also demonstrated that designers are biased towards aesthetically pleasing interfaces, regardles of efficiency [SearsSears1995]. Screen layout tools, especially validated ones, assist designers with objective evaluation of WIMP UIs. Such tools have been effective at identifying visual problems (e.g., inefficient screen usage, misaligned elements, or size imbalance among elements), but they cannot detect logic and semantic problems that arise during usage. Although such tools appear easy to use and learn, their application is dependent on the development platform employed.

Two automatic analysis tools use guidelines for Web site usability checks. The Web Static Analyzer Tool (SAT) [Scholtz and LaskowskiScholtz and Laskowski1998], part of the NIST WebMetrics suite of tools, assesses static HTML according to a number of guidelines, such as whether all graphics contain ALT tags and the average number of words in link text. Future plans for this tool include adding the ability to inspect the entire site more holistically in order to identify potential problems in interactions between pages. Bobby [CASTCAST2000] is another HTML analysis tool that checks Web pages for their accessibility to people with disabilities. Conforming to the guidelines embedded in these tools can potentially eliminate usability problems that arise due to poor HTML syntax (e.g., missing page elements). However, some problems, such as adequate color contrast for color-blind users or functional scripts, are difficult to detect automatically.

A similar automated analysis tool, The Rating Game [SteinStein1997], attempts to measure the quality of a set of pages using a set of easily measurable features. These include: an information feature (word to link ratio), a graphics feature (number of graphics on a page), a gadgets feature (number of applets, controls and scripts on a page), and so on.

Two authoring tools from Middlesex University, HyperAT [Theng and MarsdenTheng and Marsden1998] and Gentler [ThimblebyThimbleby1997], perform a similar structural analysis at the site level. The goal of the Hypertext Authoring Tool (HyperAT) is to support the creation of well-structured hyperdocuments. It provides a structural analysis which focuses on verifying that the breadths and depths within a page and at the site level fall within thresholds. (HyperAT also supports inferential analysis of server log files similar to other log file analysis techniques; see Section 4.2.) Gentler [ThimblebyThimbleby1997] provides similar structural analysis but focuses on maintenance of existing sites rather than design of new ones.

The Rating Game, HyperAT and Gentler compute and report a number of statistics about a page (e.g., number of links, graphics and words). However, the effectiveness of these structural analyses is questionable, since the threshholds have not been empirically validated. Although there have been some investigations into breadth and depth tradeoffs for the Web [Larson and CzerwinskiLarson and Czerwinski1998, Zaphiris and MteiZaphiris and Mtei1997], general threshholds still remain to be established. These approaches are easy to use, learn and apply to all Web UIs.

Automatic Critique Methods

Critiques give designers clear directions for conforming to violated guidelines and consequently improving usability. As mentioned above, following guidelines has historically been problematic, especially for a large number of guidelines. Automatic critique approaches, especially ones that modify a UI, provide the highest level of support for adhering to guidelines.

The KRI/AG tool (Knowledge-based Review of user Interface) [Lowgren and NordqvistLowgren and Nordqvist1992] is an automatic critique that checks the guideline conformance of X Window UI designs created using the TeleUSE UIMS (User Interface Management System) [LeeLee1997]. KRI/AG contains a knowledge base of guidelines and style guides, including the Smith and Mosier guidelines [Smith and MosierSmith and Mosier1986] and Motif style guides [Open Software FoundationOpen Software Foundation1991]. It uses this information to automatically critique a UI design and generate comments about possible flaws in the design. SYNOP [BalboBalbo1995] is a similar automatic critique system that performs a rule-based (i.e., expert system) critique of a control system application. SYNOP also modifies the UI model based on its evaluation.

Both of these approaches are highly effective at informing UI improvements for those guidelines that can be operationalized. These include checking for the existence of labels for text fields, listing menu options in alphabetical order, and setting default values for input fields. However, they cannot assess UI aspects that cannot be operationalized, such as whether the labels used on elements will be understood by users. Another drawback of these approaches is that they require considerable modeling and learning effort. They also suffer from limited applicability.

Inquiry Methods

Similar to user testing approaches, inquiry methods require feedback from users and are often employed during user testing. The focus, however, is not on studying specific tasks or measuring performance. Rather the goal of these methods is to gather subjective impressions (i.e., preferences or opinions) about various aspects of a UI. Evaluators usually employ inquiry methods, such as surveys, questionnaires, and interviews, to gather supplementary data after a system is released. Inquiry methods are summative in nature and provide feedback on the overall quality of the interface, such as whether users like it, generally have problems with it, and so on. This is useful information for improving the interface for future releases. These methods vary based on whether the evaluator interacts with a user or a group of users or whether users report their experiences using questionnaires, surveys or usage logs possibly in conjunction with screen snapshots.

Automation has been used predominately for automatically capturing use data during formal testing or informal use.

Interactive surveys and questionnaires can be embedded into a user interface to semi-automate the usage capture process. The Web inherently facilitates automatic capture of survey and questionnaire data using forms. These approaches enable the evaluator to collect subjective usability data and possibly make improvements throughout the life of an interface.

As previously discussed, automatic capture methods represent an important first step toward informing UI improvements. Automated inquiry methods make it possible to collect data quickly from a larger number of users than is typically possible with non-automated methods. However, automated inquiry methods suffer from the same limitation of non-automated approaches - they may not clearly indicate usability problems due to the subjective nature of user responses. Furthermore, they do not enable automated analysis or critique of interfaces. The real value of these techniques is that they are easy to use and widely applicable.

Analytical Modeling Methods

Analytical modeling complements traditional evaluation techniques like user testing. Given some representation or model of the UI and/or user, these methods inexpensively generate quantitative usability predictions. Automation has been predominately used to analyze task completion (e.g., execution and learning time) within WIMP UIs and Web site structure (e.g., breadth and depth). Analytical modeling inherently supports automatic analysis. Our survey did not reveal analytical modeling techniques to support automated critique. Most analytical modeling and simulation approaches for WIMP and Web UIs are based on the model human processor (MHP) proposed by Card et al. [Card, Moran, and NewellCard et al.1983]. GOMS analysis is one of the most widely accepted analytical modeling methods based on the MHP [John and KierasJohn and Kieras1996]. Other methods based on the MHP employ simulation and will be discussed in the next section.

The GOMS family of analytical methods use a task structure consisting of Goals, Operators, Methods and Selection rules. Using this task structure along with validated time parameters for each operator, the methods predict task execution and learning times for error-free expert performance. The four approaches in this family include the original GOMS method proposed by Card, Moran and Newell (CMN-GOMS) [Card, Moran, and NewellCard et al.1983], the simpler keystroke-level model (KLM), the natural GOMS language (NGOMSL) and the critical path method (CPM-GOMS) [John and KierasJohn and Kieras1996]. These approaches differ in the task granularity modeled (e.g., keystrokes or a high-level procedure) and in the support for alternative methods (i.e., selections) and multiple goals.

Two of the major roadblocks to using GOMS have been the tedious task analysis and the need to calculate execution and learning times. These must be specified and computed manually. USAGE (the UIDE System for semi-Automated GOMS Evaluation) [Byrne, Wood, Sukaviriya, Foley, and KierasByrne et al.1994] and CRITIQUE (the Convenient, Rapid, Interactive Tool for Integrating Quick Usability Evaluations) [Hudson, John, Knudsen, and ByrneHudson et al.1999] are tools that address these limitations by automatically generating a task model and quantitative predictions for the model. Both of these tools accomplish this within a user interface development environment (UIDE). GLEAN (GOMS Language Evaluation and ANalysis) [Kieras, Wood, Abotel, and HornofKieras et al.1995] is another tool that generates quantitative predictions for a given GOMS task model (discussed in more detail in Section 8). These tools reduce the effort required to employ GOMS analysis and generate predictions that are consistent with models produced by experts. The major hinderence to their wide application is that they operate on limited platforms (e.g., Sun machines), model low-level goals (e.g., at the keystroke level for CRITIQUE), and do not support multiple ways of accomplishing tasks because they use an idealized expert user model.

Programmable user models (PUM) [Young, Green, and SimonYoung et al.1989] is an entirely different analytical modeling technique for automatic analysis. In this approach, the designer is required to write a program that acts like a user using the interface design; the designer must specify explicit sequences of operations for each task. These are executed within an architecture that imposes approximations of psychological constraints, such as limitations on the amount of informaiton a user can be asked to remember. Difficulties experienced by the designer while programming the architecture can then be used to improve the UI. Once the designer successfully programs the architecture, the model can be executed to generate quantitative performance predictions similar to GOMS analysis. By making a designer aware of considerations and constraints affecting usability from the user's perspective, this approach provides clear insight into specific problems with a UI.

Analytical modeling approaches enable the evaluator to produce relatively inexpensive results to inform design choices. GOMS has been shown to be applicable to all types of UIs and is effective at predicting usability problems. However, these predictions are limited to error-free expert performance. The development of USAGE and CRITIQUE has reduced the learning time and effort required to apply GOMS analysis, but they suffer from limitations previously discussed. PUMs, however, still requires considerable effort and learning time to employ, since it is a programming approach. Although it appears that this technique is applicable to all WIMP UIs, its effectiveness is not discussed in detail in the literature.

Analytical modeling of Web UIs lags far behind efforts for WIMP interfaces. Many Web authoring tools, such as Microsoft FrontPage and Macromedia
Dreamweaver, provide limited support for usability evaluation in the design phase (e.g., predict download time and check HTML syntax). This addresses only a small fraction of usability problems. While potentially beneficial, our survey did not uncover any analytical modeling techniques to address this gap in Web site evaluation. Approaches like GOMS analysis will not map as well to the Web domain, because it is difficult to predict how a user will accomplish the goals in a task hierarchy given that there are many different ways to navigate a typical site. Another problem is GOMS' reliance on an expert user model, which does not fit the diverse user community of the Web. Hence, new analytical modeling approaches are required to evaluate the usability of Web sites.

Simulation Methods

Simulation complements traditional UE methods and, like analytical modeling, inherently supports automatic analysis. Using models of the user and/or UI, these approaches simulate the user interacting with the interface and report the results of this interaction. Simulation is also used to automatically generate usage data for analysis with log file analysis techniques [Chi, Pirolli, and PitkowChi et al.2000] or event playback in a UI [Kasik and GeorgeKasik and George1996]. Hence, simulation also supports automatic capture. Evaluators can run simulations with different parameters in order to study various UI design tradeoffs and thus make more informed decisions about UI implementation.

Automatic Capture

Kasik96 developed an automatic capture technique for driving replay tools (i.e., execute a log file) for Motif-based UIs. The goal of this work is to use a small number of input parameters to inexpensively generate a large number of usage traces (or test scripts) that an evaluator can then use to find weak spots, failures and other usability problems during the design phase. The system enables a designer to generate an expert user trace and then insert deviation commands at different points within the trace. It uses a genetic algorithm to determine user behavior during deviation points, and in effect simulates a novice user learning by experimentation. Genetic algorithms consider past history in generating future random numbers; this enables the emulation of user learning. Altering key features of the genetic algorithm enables the evaluator to simulate other user models. Although currently not supported by this tool, traditional random number generation can also be employed to explore the outer limits of a UI (i.e., completely random user behavior).

Without such an automated capture technique, the evaluator must anticipate all possible usage scenarios or rely on user testing or informal use to generate usage traces. Testing and informal use limit UI coverage to a small number of tasks or to UI features that are employed in regular use. Automated capture techniques, such as the genetic algorithm approach, enable the evaluator to produce a larger number of usage scenarios and widen UI coverage with minimal effort. This system appears to be relatively straightforward to use, since it interacts directly with a running application and does not require modeling. Interaction with the running application also ensures that generated usage traces are plausible. Experiments demonstrated that it is possible to generate a large number of usage traces within an hour. However, an evaluator must manually analyze the execution of each trace in order to identify problems. The authors propose future work to automatically verify that a trace produced the correct result. Currently, this tool is only applicable to Motif-based UIs.

Pirolli00 developed a similar automatic capture approach for generating navigation paths for Web UIs. This approach creates a model of an existing site that embeds information about the similarity of content among pages, captured usage data, and linking structure. The evaluator specifies starting points in the site and information needs (i.e., target pages) as input to the simulator. The simulation models a number of agents (i.e., hypothetical users) traversing the links and content of the site model. At each page, the model considers information scent (i.e., common keywords between an agent's goal and content on linked pages) in making navigation decisions. Navigation decisions are controlled probabilistically such that most agents traverse higher-scent links (i.e., closest match to information goal) and some agents traverse lower-scent links. Simulated agents stop when they reach the target pages or after an arbitrary amount of effort (e.g., maximum number of links or browsing time). The simulator records navigation paths and reports the proportion of agents that reached target pages.

The authors use these usage paths as input to the Dome Tree visualization methodology, an inferential log file analysis approach discussed in Section 4. The authors compared actual and simulated navigation paths for Xerox's corporate site and discovered a close match when scent is clearly visible (i.e., not buried under graphics or text). Since the site model does not consider actual page elements, the simulator cannot account for the impact of various page aspects, such as the amount of text or reading complexity, on navigation choices. Hence, this approach may enable only crude approximations of user behavior for sites with complex pages.

Automatic Analysis

All of the WIMP simulations that support automatic analysis rely on some variation of a human information processor model similar to the MHP previously discussed. Pew and Mavor [Pew and MavorPew and Mavor1998] provide a detailed discussion of this type of modeling and an overview of many of these approaches, including five that we discuss: ACT-R (Adaptive Control of Thought) [AndersonAnderson1990], COGNET (COGnition as a NEtwork of Tasks) [Zachary, Mentec, and RyderZachary et al.1996], EPIC (Executive-Process Interactive Control) [Kieras, Wood, and MeyerKieras et al.1997], HOS (Human Operator Simulator) [Glenn, Schwartz, and RossGlenn et al.1992] and Soar [Polk and RosenbloomPolk and Rosenbloom1994]. Here, we also consider CCT (Cognitive Complexity Theory) [Kieras and PolsonKieras and Polson1985], ICS (Interacting Cognitive Subsystems) [BarnardBarnard1987] and GLEAN (GOMS Language Evaluation and ANalysis) [Kieras, Wood, Abotel, and HornofKieras et al.1995]. Rather than describe each method individually, we summarize the major characteristics of these simulation methods in Table 3 and discuss them below.

Table 3: Characteristics of simulation methods surveyed.

Modeled Tasks: The models we surveyed simulate the following 3 types of tasks: a user performing cognitive tasks (e.g., problem-solving and learning:
COGNET, ACT-R, Soar, ICS); a user immersed in a human-machine system (e.g., an aircraft or tank: HOS); and a user interacting with a typical UI (EPIC, GLEAN, CCT).
Modeled Components: Some simulations focus solely on cognitive processing (ACT-R, COGNET) while others incorporate perceptual and motor processing as well (EPIC, ICS, HOS, Soar, GLEAN, CCT).
Component Processing: Task execution is modeled either as serial processing (ACT-R, GLEAN, CCT), parallel processing (EPIC, ICS, Soar), or semi-parallel processing (serial processing with rapid attention switching among the modeled components, giving the appearance of parallel processing: COGNET, HOS).
Model Representation: To represent the underlying user, simulation methods use either task hierarchies (as in a GOMS task structure: HOS, CCT), production rules (CCT, ACT-R, EPIC, Soar, ICS), or declarative/procedural programs (GLEAN, COGNET). CCT uses both a task hierarchy and production rules to represent the user and system models respectively.
Predictions: The surveyed methods return a number of simulation results, including predictions of task performance (EPIC, CCT, COGNET, GLEAN, HOS, Soar), memory load (ICS, CCT), learning (ACT-R, SOAR, ICS, GLEAN, CCT), or behavior predictions such as action traces (ACT-R, COGNET, EPIC).

Simulation methods vary widely in their ability to illustrate usability problems. Their effectiveness is largely determined by the characteristics discussed (modeled tasks, modeled components, component processing, model representation and predictions). Methods that are potentially the most effective at illustrating usability problems model UI interaction and all components (perception, cognition and motor) processing in parallel, employ production rules and report on task performance, memory load, learning and simulated user behavior. Such methods would enable the most flexibility and closest approximation of actual user behavior. The use of production rules is important in this methodology, because it relaxes the requirement for an explicit task hierarchy, thus allowing for the modeling of more dynamic behavior, such as Web site navigation.
EPIC is the only simulation analysis method that embodies most of these ideal characteristics. It employ production rules and models UI interaction and all components (perception, cognition and motor) processing in parallel. It reports task performance and simulated user behavior, but does not report memory load and learning estimates. Studies with EPIC demonstrated that predictions for telephone operator and menu searching tasks closely match observed data. EPIC and all of the other methods require considerable learning time and effort to employ. They are also applicable to a wide range of UIs.
Our survey revealed only one simulation approach for automatic analysis of Web interfaces - WebCriteria's Site Profile [Web CriteriaWeb Criteria1999]. Unlike the other simulation approaches, this approach requires an implemented interface for evaluation. Site Profile performs analysis in four phases: gather, model, analyze and report. During the gather phase, a spider traverses a site (200-600 unique pages) to collect Web site data. This data is then used to construct a nodes-and-links model of the site. For the analysis phase, it uses a standard Web user model (called Max [Lynch, Palmiter, and TiltLynch et al.1999]) to simulate a user's information seeking behavior; this model is based on prior research with GOMS analysis. Given a starting point in the site, a path and a target, Max ``follows'' the path from the starting point to the target and logs measurement data. These measurements are used to compute an accessibility metric which is then used to generate a report. This approach can be used to compare Web sites, provided that an appropriate navigation path is supplied for each.
The usefulness of this approach is questionable, since currently it only computes accessibility (navigation time) for the shortest path between specified start and destination pages using a single user model. Other measurements, such as freshness and page composition, also have questionable value in improving the Web site. This method does not entail any learning time or effort on the part of the evaluator, since the analysis is performed by WebCriteria. The method is applicable to all Web UIs.

Expanding Existing Automated Usability Evaluation Methods

Automated usability evaluation has many potential benefits, including reducing the costs of non-automated methods, aiding in comparisons between alternative designs and for improving consistency in problems found. Research to further develop analytical modeling, simulation and log file analysis techniques could result in several promising AUE techniques as discussed below.

Our survey showed log file analysis to be a viable methodology for automated analysis of usage data. However, it still requires formal testing or informal use to employ. One way to expand the use and benefits of this methodology is to leverage a small amount of test data to generate a larger set of plausible usage data. This is even more important for Web interfaces, since server logs do not capture a complete record of user interactions. We discussed two simulation approaches, genetic algorithms and information scent modeling, that automatically generate plausible usage data. Genetic algorithms determine user behavior during deviation points in an expert user script, while the information scent model selects navigation paths by considering information scent. Hence, both of these approaches generate plausible usage traces without user testing or informal use. These techniques also provide valuable insight on leveraging real usage data from usability tests or informal use. For example, real data could also serve as input scripts for genetic algorithms; the evaluator could add deviation points to these as well.

Real and simulated usage data could also be used to evaluate comparable WIMP UIs, such as word processors and image editors. Task sequences could comprise a usability benchmark (i.e., a program for measuring UI performance). After mapping task sequences into specific UI operations in each interface, the benchmark could be executed within each UI to collect measurements. This is a promising open area of research for evaluating comparable WIMP UIs.

Given a wider sampling of usage data, task-based pattern-matching log file analysis is a promising research area to pursue. Task-based approaches that follow the USINE model in particular (i.e., compare a task model expressed in terms of temporal relationships to usage traces) provide the most support, among the methods surveyed, for understanding user behavior, preferences and errors. Although the authors claim that this approach works well for WIMP UIs, it needs to be adapted to work for Web UIs where tasks may not be clearly-defined. Additionally, since USINE already reports substantial analysis data, this data could be compared to usability guidelines in order to support automated critique.

Our survey also showed that evaluation within a user interface development environment (UIDE) is a promising approach for automated analysis. The AIDE approach provides the most support for evaluating and improving UI designs and could be expanded to Web interfaces. Guidelines could also be incorporated into AIDE analysis to support automatic critique. Although UIDE analysis is promising, it is not widely used in practice. This may be due to the fact that most tools are research systems and have not been incorporated into popular commercial tools. Applying such analysis approaches outside of these user interface development environments is an open research problem.

In addition, our survey showed that existing simulations based on a human information processor model have widely different uses (e.g., modeling a user interacting with a UI or solving a problem). Thus, it is difficult to draw concrete conclusions about the effectiveness of these approaches. Simulation in general is a promising research area to pursue for AUE, especially for evaluating alternative designs.

Several simulation techniques employed in the performance analysis of computer systems, in particular traced-driven discrete-event simulation and Monte Carlo simulation [JainJain1991], would enable designers to perform what-if analysis with UIs. Trace-driven discrete-event simulations employ real usage data to model a system as it evolves over time. Analysts use this approach to simulate many aspects of computer systems, such as the processing subsystem, operating system and various resource scheduling algorithms. In the user interface field, the surveyed approaches use discrete-event simulation. These simulators could be altered to process log files as input instead of an explicit task or user models, potentially producing more realistic and accurate simulations.

Monte Carlo simulations enable an evaluator to model a system probabilistically (i.e., a probability distribution over possible events determines what event occurs next). Monte Carlo simulation could contribute substantially to automated UE by eliminating the need for explicit task hierarchies or user models. Most simulations in this domain rely on a single user model, typically an expert user. This would enable designers to perform what-if analysis and study design alternatives with many user models. The approach employed by Pirolli00 to simulate Web site navigation is a close approximation to Monte Carlo simulation.

Conclusions

In this article we provided an overview of automated usability evaluation and presented a taxonomy for comparing various methods. We also presented an extensive survey of AUE methods for WIMP and Web interfaces, finding that AUE methods represent only 35% of methods surveyed. Of these methods, only 20% are free from requirements of formal testing or informal use. Of these, all approaches, with the exception of operationalized quidelines, are based on analytical modeling or simulation.

It is important to keep in mind that AUE does not capture important qualitative and subjective information (such as user preferences and misconceptions) that can only be unveiled via user testing, heuristic evaluation, and other standard inquiry methods. Nevertheless, simulation and analytical modeling should be useful for helping designers choose among design alternatives before committing to expensive development costs.

Furthermore, evaluators could use automated approaches in tandem with non-automated methods, such as heuristic evaluation and user testing. For example, an evaluator doing a heuristic evaluation could observe automatically-generated usage traces executing within a UI.

Automation Characteristics of WIMP and Web Interfaces

The following tables depict automation characteristics for WIMP and Web interfaces separately. We combined this information in Table 2.

Table 4: Automation characteristics of WIMP UE methods. A number in parentheses indicates the number of methods surveyed for a particular method and automation type. The testing level for each method is represented as: minimal (blank), formal (F), informal (I) and model (M). The * for the FIM entry indicates that either formal or informal testing is required. In addition, a model may be used in the analysis.

table344
Table 5: Automation characteristics of Web UE methods. A number in parentheses indicates the number of methods surveyed for a particular method and automation type. The testing level for each method is represented as: minimal (blank), formal (F), informal (I) and model (M).

Acknowledgments

This research was sponsored in part by the Lucent Technologies Cooperative Rese arch Fellowship Program, a GAAN fellowship and Kaiser Permanente. We thank James Hom and Zhijun Zhang for allowing us to use their extensive archives of usability methods for this survey. We also thank Zhijun Zhang for participating in several interviews on usability evaluation. We thank Bonnie John and Scott Hudson for helping us locate information on GOMS and other simulation methods for this survey, and James Landay and Mark Newman for helpful feedback and data.

References

Ahlberg and ShneidermanAhlberg and Shneiderman1994: Ahlberg, C. and Shneiderman, B. 1994. Visual information seeking: Tight coupling of dynamic query filters with starfield displays. In Human Factors in Computing Systems. Conference Proceedings CHI'94 (1994), pp. 313-317.
AndersonAnderson1990: Anderson, J. R. 1990. The Adaptive Character of Thought. Lawrence Erlbaum Associates, Hillsdale, NJ.
BacheldorBacheldor1999: Bacheldor, B. 1999. Push for performance. Information Week September 20, 18-20.
BalboBalbo1995: Balbo, S. 1995. Automatic evaluation of user interface usability: Dream or reality. In Proceedings of QCHI 95 (1995).
BalboBalbo1996: Balbo, S. 1996. ÉMA: Automatic analysis mechanism for the ergonomic evaluation of user interfaces. Technical Report 96/44, DSIRO Division of Informaiton Technology.
BarnardBarnard1987: Barnard, P. J. 1987. Cognitive resources and the learning of human-computer dialogs. In J. M. Carroll Ed., Interfacing Thought: Cognitive Aspects of Human-Computer Interaction, pp. 112-158. The MIT Press.
Bevan and MacleodBevan and Macleod1994: Bevan, N. and Macleod, M. 1994. Usability measurement in context. Behaviour and Information Technology 13, 1,2, 132-145.
Borges, Morales, and RodriguezBorges et al.1996: Borges, J. A., Morales, I., and Rodriguez, N. J. 1996. Guidelines for designing usable world wide web pages. In Proceedings of ACM CHI 96 Conference on Human Factors in Computing Systems, Volume 2 of SHORT PAPERS: Working Together Near and Far (1996), pp. 277-278.
Byrne, John, Wehrle, and CrowByrne et al.1999: Byrne, M. D., John, B. E., Wehrle, N. S., and Crow, D. C. 1999. The tangled web we wove: A taxonomy of WWW use. In Proceedings of ACM CHI 99 Conference on Human Factors in Computing Systems, Volume 1 of Organizing Information on the Web (1999), pp. 544-551.
Byrne, Wood, Sukaviriya, Foley, and KierasByrne et al.1994: Byrne, M. D., Wood, S. D., Sukaviriya, P. N., Foley, J. D., and Kieras, D. 1994. Automating interface evaluation. In Proceedings of ACM CHI'94 Conference on Human Factors in Computing Systems, Volume 1 of Automatic Support in Design and Use (1994), pp. 232-237.
Card, Moran, and NewellCard et al.1983: Card, S. K., Moran, T. P., and Newell, A. 1983. The Psychology of Human-Computer Interaction. Lawrence Erlbaum Associates, Hillsdale, NJ.
CASTCAST2000: CAST. 2000. Bobby. http://www.cast.org/bobby/.
Chi, Pirolli, and PitkowChi et al.2000: Chi, E. H., Pirolli, P., and Pitkow, J. 2000. The scent of a site: A system for analyzing and predicting information scent, usage, and usability of a web site. In Proceedings of ACM CHI 00 Conference on Conference on Human Facos in Computing Systems, To appear (2000).
CoutazCoutaz1994: Coutaz, J. 1994. Evaluation techniques: Exploring the intersection of HCI and software engineering. In Proceedings of the International Conference on Software Engineering (1994).
Dix, Finlay, Abowd, and BealeDix et al.1993: Dix, A., Finlay, J., Abowd, G., and Beale, R. 1993. Human-Computer Interaction. Prentice Hall.
DrottDrott1998: Drott, M. C. 1998. Using web server logs to improve site design. In ACM 16th International Conference on Systems Documentation, Getting Feedback on your Web Site (1998), pp. 43-50.
Etgen and CantorEtgen and Cantor1999: Etgen, M. and Cantor, J. 1999. What does getting WET (web event-logging tool) mean for web usability. In Proceedings of The Future of Web Applications: Human Factors & the Web (June 1999). Available at http://www.nist.gov/itl/div894/vvrg/hfweb/
proceedings/etgen-cantor/index.html.
Fuller and de GraaffFuller and de Graaff1996: Fuller, R. and de Graaff, J. J. 1996. Measuring user motivation from server log files. In Proceedings of the Human Factors and the Web 2 Conference, Designing for the Web (October 1996). Available from http://www.microsoft.com/usability/
webconf.htm.
Glenn, Schwartz, and RossGlenn et al.1992: Glenn, F. A., Schwartz, S. M., and Ross, L. V. 1992. Development of a human operator simulator version v (hos-v): Design and implementation. U.S. Army Research Institute for the Behavioral and Social Sciences, PERI-POX, Alexandria, VA.
Gray, John, and AtwoodGray et al.1992: Gray, W. D., John, B. E., and Atwood, M. E. 1992. The precis of project ernestine, or, an overview of a validation of GOMS. In Proceedings of ACM CHI'92 Conference on Human Factors in Computing Systems, Models of the User II (1992), pp. 307-312.
Guzdial, Santos, Badre, Hudson, and GrayGuzdial et al.1994: Guzdial, M., Santos, P., Badre, A., Hudson, S., and Gray, M. 1994. Analyzing and visualizing log files: A computational science of usability. GVU Center TR GIT-GVU-94-8, Georgia Institute of Technology.
Hammontree, Hendrickson, and HensleyHammontree et al.1992: Hammontree, M. L., Hendrickson, J. J., and Hensley, B. W. 1992. Integrated data capture and analysis tools for research and testing on graphical user interfaces. In Proceedings of ACM CHI'92 Conference on Human Factors in Computing Systems, Demonstration: Analysis Tools/Multimedia Help (1992), pp. 431-432.
Hartson, Castillo, Kelsa, and NealeHartson et al.1996: Hartson, H. R., Castillo, J. C., Kelsa, J., and Neale, W. C. 1996. Remote evaluation: The network as an extension of the usability laboratory. In M. J. Tauber, V. Bellotti, R. Jeffries, J. D. Mackinlay, and J. Nielsen Eds., Proceedings of the Conference on Human Factors in Computing Systems : Common Ground (New York, April 13-18 1996), pp. 228-235. ACM Press.
Helfrich and LandayHelfrich and Landay1999: Helfrich, B. and Landay, J. A. 1999. QUIP: quantitative user interface profiling. Unpublished manuscript. Available at http://home.earthlink.net/%7Ebhelfrich/quip/index.html.
Hochheiser and ShneidermanHochheiser and Shneiderman1999: Hochheiser, H. and Shneiderman, B. 1999. Understanding patterns of user visits to web sites: Interactive starfield visualizations of WWW log data. Technical Report CS-TR-3989 (Feb.), University of Maryland, College Park.
Holtzblatt and JonesHoltzblatt and Jones1993: Holtzblatt, K. and Jones, S. 1993. Contextual inquiry: A participatory technique for system design. In D. Schuler and A. Namioka Eds., Participatory Design: Principles and Practice (Hillsdale, NJ, 1993), pp. 180-193. Lawrence Earlbaum.
HomHom1998: Hom, J. 1998. The usability methods toolbox. http://www.best.com/ jthom/usability/usable.htm.
Hudson, John, Knudsen, and ByrneHudson et al.1999: Hudson, S. E., John, B. E., Knudsen, K., and Byrne, M. D. 1999. A tool for creating predictive performance models from user interface demonstrations. In Proceedings of the ACM Symposium on User Interface Software and Technology, To appear (1999).
Human Factors EngineeringHuman Factors Engineering1999: Human Factors Engineering. 1999. Usability evaluation methods. http://www.cs.umd.edu/ zzj/UsabilityHome.html.
Ivory and HearstIvory and Hearst1999: Ivory, M. Y. and Hearst, M. A. 1999. Comparing peformance and usability evaluation: New methods for automated usability assessment. Unpublished manuscript. Available at http://www.cs.berkeley.edu/ ivory/research/web/papers/pe-ue.pdf.
JainJain1991: Jain, R. 1991. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling,. Wiley-Interscience, New York, NY, USA.
Jeffries, Miller, Wharton, and UyedaJeffries et al.1991: Jeffries, R., Miller, J. R., Wharton, C., and Uyeda, K. M. 1991. User interface evaluation in the real world: A comparison of four techniques. In Proceedings of ACM CHI'91 Conference on Human Factors in Computing Systems, Practical Design Methods (1991), pp. 119-124.
John and KierasJohn and Kieras1996: John, B. E. and Kieras, D. E. 1996. The GOMS family of user interface analysis techniques: Comparison and contrast. ACM Transactions on Computer-Human Interaction 3, 4, 320-351.
Kasik and GeorgeKasik and George1996: Kasik, D. J. and George, H. G. 1996. Toward automatic generation of novice user test scripts. In Proceedings of ACM CHI 96 Conference on Human Factors in Computing Systems, Volume 1 of PAPERS: Evaluation (1996), pp.\ 244-251.
Kieras and PolsonKieras and Polson1985: Kieras, D. and Polson, P. G. 1985. An approach to the formal analysis of user complexity. International Journal of Man-Machine Studies 22, 4, 365-394.
Kieras, Wood, Abotel, and HornofKieras et al.1995: Kieras, D. E., Wood, S. D., Abotel, K., and Hornof, A. 1995. GLEAN: A computer-based tool for rapid GOMS model usability evaluation of user interface designs. In Proceedings of the ACM Symposium on User Interface Software and Technology, Evaluation (1995), pp. 91-100.
Kieras, Wood, and MeyerKieras et al.1997: Kieras, D. E., Wood, S. D., and Meyer, D. E. 1997. Predictive engineering models based on the EPIC architecture for a multimodal high-performance human-computer interaction task. ACM Transactions on Computer-Human Interaction 4, 3 (Sept.), 230-275.
Larson and CzerwinskiLarson and Czerwinski1998: Larson, K. and Czerwinski, M. 1998. Web page design: Implications of memory, structure and scent for information retrieval. In Proceedings of ACM CHI 98 Conference on Human Factors in Computing Systems, Volume 1 of Web Page Design (1998), pp.\ 25-32.
Lecerof and PaternóLecerof and Paternó1998: Lecerof, A. and Paternó, F. 1998. Automatic support for usability evaluation. IEEE Transactions on Software Engineering 24, 10 (October), 863-888.
LeeLee1997: Lee, K. 1997. Motif FAQ. http://www-bioeng.ucsd.edu/ fvetter/misc/Motif-FAQ.txt.
Lewis, Polson, Wharton, and RiemanLewis et al.1990: Lewis, C., Polson, P. G., Wharton, C., and Rieman, J. 1990. Testing a walkthrough methodology for theory-based design of walk-up-and-use interfaces. In Proceedings of ACM CHI 90 Conference on Human Factors in Computing Systems (1990), pp. 235-242.
Lowgren and NordqvistLowgren and Nordqvist1992: Lowgren, J. and Nordqvist, T. 1992. Knowledge-based evaluation as design support for graphical user interfaces. In Proceedings of ACM CHI'92 Conference on Human Factors in Computing Systems, Tools and Techniques (1992), pp. 181-188.
Lynch, Palmiter, and TiltLynch et al.1999: Lynch, G., Palmiter, S., and Tilt, C. 1999. The max model: A standard web site user model. In Proceedings of The Future of Web Applications: Human Factors & the Web (June 1999). Available at http://www.nist.gov/itl/div894/vvrg/hfweb/
proceedings/lynch/index.html.
Macleod and RenggerMacleod and Rengger1993: Macleod, M. and Rengger, R. 1993. The development of DRUM: A software tool for video-assisted usability evaluation. In Proceedings of the HCI'93 Conference on People and Computers VIII, User Evaluation (1993), pp. 293-309.
Molich, Bevan, Butler, Curson, Kindlund, Kirakowski, and MillerMolich et al.1998: Molich, R., Bevan, N., Butler, S., Curson, I., Kindlund, E., Kirakowski, J., and Miller, D. 1998. Comparative evaluation of usability tests. In Proceedings of UPA98 (June 1998), pp. 189-200.
Molich, Thomsen, Karyukina, Schmidt, Ede, van Oel, and ArcuriMolich et al.1999: Molich, R., Thomsen, A. D., Karyukina, B., Schmidt, L., Ede, M., van Oel, W., and Arcuri, M. 1999. Comparative evaluation of usability tests. In Proceedings of ACM CHI'99 Conference on Human Factors in Computing Systems, Panels (May 1999), pp. 83-86.
NielsenNielsen1993: Nielsen, J. 1993. Usability Engineering. Academic Press, Boston, MA.
Olsen, Jr. and HalversenOlsen, Jr. and Halversen1988: Olsen, Jr., D. R. and Halversen, B. W. 1988. Interface usage measurements in a user interface management system. In Proceedings of the ACM SIGGRAPH Symposium on User Interface Software (1988), pp. 102-108.
Open Software FoundationOpen Software Foundation1991: Open Software Foundation. 1991. OSF/Motif Style Guide. Number Revision 1.1 (for OSF/Motif release 1.1). Prentice Hall, Englewood Cliffs, NJ.
Parush, Nadir, and ShtubParush et al.1998: Parush, A., Nadir, R., and Shtub, A. 1998. Evaluating the layout of graphical user interface screens: Validation of a numerical, computerized model. International Journal of Human Computer Interaction 10, 4, 343-360.
Paternó, Mancini, and MeniconiPaternó et al.1997: Paternó, F., Mancini, C., and Meniconi, S. 1997. ConcurTaskTrees: Diagrammatic notation for specifying task models. In Proceedings of INTERACT '97 (1997), pp. 362-369. Sydney: Chapman and Hall.
Pew and MavorPew and Mavor1998: Pew, R. W. and Mavor, A. S. Eds. 1998. Modeling Human and Organizational Behavior: Application to Military Simulations. National Academy Press, Washington. Available at http://books.nap.edu/html/model.
Polk and RosenbloomPolk and Rosenbloom1994: Polk, T. A. and Rosenbloom, P. S. 1994. Task-independent constraints on a unified theory of cognition. In F. Boller and J. Grafman Eds., Handbook of Neuropsychology, Volume 9. Amsterdam, Netherlands: Elsevier.
Rieman, Davies, Hair, Esemplare, Polson, and LewisRieman et al.1991: Rieman, J., Davies, S., Hair, D. C., Esemplare, M., Polson, P., and Lewis, C. 1991. An automated cognitive walkthrough. In Proceedings of ACM CHI'91 Conference on Human Factors in Computing Systems, Demonstrations: Interface Design Issues (1991), pp.\ 427-428.
Scholtz and LaskowskiScholtz and Laskowski1998: Scholtz, J. and Laskowski, S. 1998. Developing usability tools and techniques for designing and testing web sites. In Proceedings of the 4th Conference on Human Factors & the Web (1998). Available at http://www.research.att.com/conf/hfweb/
proceedings/scholtz/index.html.
SearsSears1995: Sears, A. 1995. AIDE: A step toward metric-based interface development tools. In Proceedings of the ACM Symposium on User Interface Software and Technology, Evaluation (1995), pp. 101-110.
Service MetricsService Metrics1999: Service Metrics. 1999. Service metrics solutions. http://www.servicemetrics.com/
solutions/solutionsmain.asp.
Siochi and HixSiochi and Hix1991: Siochi, A. C. and Hix, D. 1991. A study of computer-supported user interface evaluation using maximal repeating pattern analysis. In Proceedings of ACM CHI'91 Conference on Human Factors in Computing Systems, User Interface Design Process and Evaluation (1991), pp. 301-305.
Smith and MosierSmith and Mosier1986: Smith, S. L. and Mosier, J. N. 1986. Guidelines for designing user interface software. Technical Report ESD-TR-86-278, The MITRE Corporation, Bedford, MA 01730.
SteinStein1997: Stein, L. D. 1997. The rating game. http://stein.cshl.org/ lstein/rater/.
SullivanSullivan1997: Sullivan, T. 1997. Reading reader reaction: A proposal for inferential analysis of web server log files. In Proceedings of the Human Factors and the Web 3 Conference, Practices & Reflections (June 1997). Available from http://www.uswest.com/web-conference/index.html.
Theng and MarsdenTheng and Marsden1998: Theng, Y. L. and Marsden, G. 1998. Authoring tools: Towards continuous usability testing of web documents. In Proceedings of the 1st International Workshop on Hypermedia Development (1998).
ThimblebyThimbleby1997: Thimbleby, H. 1997. Gentler: A tool for systematic web authoring. International Journal of Human-Computer Studies 47, 1, 139-168.
Web CriteriaWeb Criteria1999: Web Criteria. 1999. Max, and the objective measurement of web sites. http://www.webcriteria.com.
Whitefield, Wilson, and DowellWhitefield et al.1991: Whitefield, A., Wilson, F., and Dowell, J. 1991. A framework for human factors evaluation. Behaviour and Information Technology 10, 1, 65-79.
Young, Green, and SimonYoung et al.1989: Young, R. M., Green, T. R. G., and Simon, T. 1989. Programmable user models for predictive evaluation of interface designs. In Proceedings of ACM CHI'89 Conference on Human Factors in Computing Systems, New Directions in Theory for Human-Computer Interaction (1989), pp. 15-19.
Zachary, Mentec, and RyderZachary et al.1996: Zachary, W., Mentec, J.-C. L., and Ryder, J. 1996. Interface agents in complex systems. In C. N. Ntuen and E. H. Park Eds., Human Interaction With Complex Systems: Conceptual Principles and Design Practice. Kluwer Academic Publishers.
Zaphiris and MteiZaphiris and Mtei1997: Zaphiris, P. and Mtei, L. 1997. Depth vs. breadth in the arrangement of Web links. http://www.otal.umd.edu/SHORE/bs04.
Zettlemoyer, Amant, and DulbergZettlemoyer et al.1999: Zettlemoyer, L. S., Amant, R. S., and Dulberg, M. S. 1999. IBOTS: Agent control through the user interface. In Proceedings of the 1999 International Conference on Intelligent User Interfaces, Information Retrieval Agents (1999), pp.\ 31-37.

About this document ...

State of the Art in Automated Usability Evaluation of User Interfaces (DRAFT)

The command line arguments were:
latex2html -split 0 -t Automated Usability Evaluation of UIs (DRAFT) survey.

The translation was initiated by Melody Ivory on Sat Apr 15 10:53:40 PDT 2000

...Ivory

Supported by the Lucent Technologies Cooperative Research Fellowship Program.

...use.

From ISO9241 ( Ergonomic requirements for office work with visual display terminals).

...files

Created by the QC/Replay tool for X Windows (http://www.centerline.com/productline/qcreplay/qcreplay.html).

...released.

Contextual inquiry [Holtzblatt and JonesHoltzblatt and Jones1993] is an exception to this; it is a needs assessment method used early in the design process.

Melody Ivory
Sat Apr 15 10:53:40 PDT 2000