This paper demonstrates the importance of assessing the performance of fundamental frequency estimation algorithms with metrics that capture the temporal characteristics of f 0 traces, particularly those that are calculated at the note level. Capturing temporal characteristics is particularly important for tasks that model human engagement with music, such as the study of expressive music performance. Note-level descriptors provide a better description of the human experience of listening to music and thus a more perceptually-relevant evaluation of algorithms than frame-level metrics. This paper evaluates the magnitude of accuracy differences between a simple mean-based frame-level accuracy measurement and four metrics that capture more perceptually-relevant aspects of the evolution of f 0 traces over time (perceived pitch, vibrato rate, vibrato depth, and jitter) for two score-informed f 0 estimation algorithms. The algorithms' accuracies are compared on multi-track recordings of either four vocalists or four instrumentalists (violin, saxophone, clarinet, and bassoon) both on the original anechoic recordings and mixes with artificial reverberation added. The algorithms performed within the margin of error of each other for frame-level accuracy but had significantly different accuracies on all of perceptually-relevant metrics. This paper concludes by proposing new evaluation metrics that capture temporal characteristics of fundamental frequency traces.