Reducing Attributes and Rows
MIDS W209: Information Visualization

Partially based on slides from Tamara Munzner

What We Are Going to Learn

  • Reduce
    • Items
    • Attributes
  • Aggregation
    • Item
    • Spatial
    • Time
  • Dimensionality Reduction
  • Embed, Focus and Context
  • Exploratory Data Analysis
University Of California at Berkeley logo

Reduce Items and Attributes

Reduce Items and Attributes

  • Reduce/increase: inverses
  • Filter
    • Pro: straightforward and intuitive
      • To understand and compute
    • Con: out of sight, out of mind
  • Aggregation
    • Pro: inform about whole set
    • Con: difficult to avoid losing signal
  • Not mutually exclusive
    • Combine filter, aggregate
    • Combine reduce, change, facet
Filter by items and by attributes; aggregate by items and by attributes; reduce filter and aggregate
University Of California at Berkeley logo

Item Filtering

Crossfiltering

  • Item filtering
  • Coordinated views/controls combined
  • All scented histogram bisliders update when any ranges change

Faceted Search

Idiom: Scented Widgets

Scented Widgets Paper

http://vis.berkeley.edu/papers/scented_widgets/

Navio

Navio Demo
https://navio.dev
University Of California at Berkeley logo

Attribute Filtering

DOSFA Paper

http://www.cs.ubc.ca/~tmm/courses/cpsc533c-04-spr/readings/dimorder.pdf

Navio Load Notebook

UMAP Playground

Dimensionality Reduction

Aggregation: Hierarchichal Cluster Explorer

University Of California at Berkeley logo

Item Aggregation

Idiom: Histogram

  • Static item aggregation
  • Task: find distribution
  • Data: table
  • Derived data
    • New table: keys are bins, values are counts
  • Bin size crucial
  • Pattern can change dramatically depending on discretization
  • Opportunity for interaction: control bin size on the fly
Histogram

Idiom: Boxplot

  • Static item aggregation
  • Task: find distribution
  • Data: table
  • Derived data
    • Five quantitative attributes
      • Median: central line
      • Lower and upper quartile: boxes
      • Lower upper fences: whiskers
        • Values beyond which items are outliers
    • Outliers beyond fence cutoffs explicitly shown
Boxplot
[40 years of boxplots. Wickham and Stryjewski. 2012. had.co.nz]

Box Plot

http://blockbuilder.org/mbostock/4061502by mbostock

Violin Plot

http://blockbuilder.org/asielen/92929960988a8935d907e39e60ea8417by asielen

Idiom: 2D Density Plots

  • Scatterplot meets heatmap
    • Derived data:
      • Tesselate space info areas
      • Count number of elements falling on that area
    • Mark: dots (boxes)
    • Channels:
      • Position: location of areas
      • Color (brightness): number of elements
      • Marks (re-)ordered by cluster hierarchy traversal
    • Tasks: summarize distribution
    • Scalability:
      • Millions of rows (might require preprocessing)

Interactive Density Plot

Idiom: Hierarchical Parallel Coordinates

  • Dynamic item aggregation
  • Derived data: hierarchical clustering
  • Encoding:
    • Cluster band with variable transparency, line at mean, width by min/max values
    • Color by proximity in hierarchy
[Hierarchical Parallel Coordinates for Exploration of Large Datasets. Fua, Ward, and Rundensteiner. Proc. IEEE Visualization Conference (Vis ’99), pp. 43– 50, 1999.]
University Of California at Berkeley logo

Spatial Aggregation

Geo Level

  • Country
  • State
  • City
  • Neighborhood

Aggregation Problems

  • MAUP: Modifiable Areal Unit Problem
  • Gerrymandering (manipulating voting district boundaries) is only one example!
  • Zone effects
  • Scale effects
Gerrymandering
[http://www.e-education.psu/edu/geog486/l4_p7.html, Fig 4.cg.6]

Overlapping

  • ZIP codes
  • Disputed borders

Regions

  • Aggregate by commonalities
    • e.g. Agricultural vs. industrial regions
    • e.g. Historically right- vs. left-wing
  • Aggregate by the data attributes

Geo patterns vs. political patterns

  • Risaralda example
University Of California at Berkeley logo

Time Aggregation

Date Part vs. Truncate

  • Date part: extract a part of the date
  • Date truncate: cut the date at a certain level

Date Truncate

  • Different levels can hide seasonality.
  • Sometimes, too much detail is unnecessary.

Truncate dates

Date Part

  • Useful for highlighting human patterns
    • Weekends
    • Night time
    • Holidays
    • Summer vs. winter

Aggregate by date parts

Window Average/Median

Covid Moving Average by state
NY Times How Coronavirus Cases Have Risen Since States Reopened July 9th 2020
University Of California at Berkeley logo

Dimensionality Reduction

Dimensionality Reduction

  • Attribute aggregation
    • Derive low-dimensional target space from high-dimensional measured space
      • Capture most of variance with minimal error
    • Use when you can’t directly measure what you care about
    • True dimensionality of dataset conjectured to be smaller than dimensionality of measurements
    • Latent factors, hidden variables
Taking tumor measurement data in 9D measured space and running dimensionality reduction derives that data in a 2D target space where it is easier to see groupings of benign and malignant tumors

Dimensionality Reduction for Documents

Dimensionality vs. Attribute Reduction

  • Vocab use in field not consistent
    • Dimension/attribute
  • Attribute reduction: reduce set with filtering
    • Includes orthographic projection
  • Dimensionality reduction (DR): create smaller set of new dimensionss/attributes
    • Typically implies dimensional aggregation, not just filtering
    • Vocabulary: projection/mapping

Estimating True Dimensionality

  • How do you know when you would benefit from DR?
    • Consider error for low-dim projection vs. high-dim projection
  • No single correct answer; many metrics proposed
    • Cumulative variance that is not accounted for
    • Strain: match variations in distance (vs. actual distance values)
    • Stress: difference between interpoint distances in high and low dimensionss
Stresss Function

Estimating True Dimensionality

  • Scree plots as simple way: error against number of attributes
    • Original dataset: 294 dimensions
    • Estimate: Almost all variance preserved with less than 20 dimensions
Spree Plots
[Fig 2. DimStiller: Workflows for dimensional analysis and reduction. Ingram et al. Proc. VAST 2010, p 3-10]

Dimensionality Reduction and Visualization

  • Why do people do DR?
    • Improve performance of downstream algorithm
      • Avoid curse of dimensionality
    • Data analysis
      • If looking at the output: visual data analysis
  • Abstract tasks when visualizing DR data
    • Dimension-oriented tasks
    • Naming synthesized dimensions, mapping synthesized dimensions to original dimensions
  • Cluster-oriented tasks
    • Verifying clusters, naming clusters, matching clusters and classes
[Visualizing Dimensionally-Reduced Data: Interviews with Analysts and a Characterization of Task Sequences. Brehmer, Sedlmair, Ingram, and Munzner. Proc. BELIV 2014.]

Linear Dimensionality Reduction

  • Principal components analysis (PCA)
    • Finding axes: first with most variance, second with next most, etc.
    • Describe location of each point as linear combination of weights for each axis
      • Mapping synthesized dimensions to original dimensions
Linear Dimensionality Reduction
[http://en.wikipedia.org/wiki/File:GaussianScatterPCA.png]

Nonlinear Dimensionality Reduction

  • Pro: can handle curved rather than linear structure
  • Con: lose all ties to original dimensions/attributes
    • New dimensions often cannot be easily related to originals
      • Mapping synthesized dims to original dims task is difficult
  • Many techniques proposed
  • Many literatures: visualization, machine learning, optimization, psychology, etc.
  • Techniques: t-SNE, MDS (multidimensional scaling), charting, isomap, LLE, etc.
  • t-SNE: excellent for clusters
    • But some trickiness remains: a(href="http://distill.pub/2016/misread-tsne/") [How to Use t-SNE Effectively]
  • MDS: confusingly, entire family of techniques, both linear and nonlinear
    • Minimize stress or strain metrics
    • Early formulations equivalent to PCA

t-SNE Explorations

http://distill.pub/2016/misread-tsne/

Interactive T-SNE

Project by Fabián Peña
MLExplore.js: Exploring High-Dimensional Data by Interacting and Interpreting t-SNE and K-Means
University Of California at Berkeley logo

Embed, Focus+Context

Embed: Focus+Context

  • Combine information within single view
  • Elide
    • Selectively filter and aggregate
  • Superimpose layer
    • Local lens
  • Distortion design choices
    • Region shape: radial, rectilinear, complex
    • How many regions: one, many
    • Region extent: local, global
    • Interaction metaphor
elide data, superimpose data, distort geometry

Idiom: DOITrees Revisited

  • Elide
    • Some items dynamically filtered out
    • Some items dynamically aggregated together
    • Some items shown in detail
[DOITrees Revisited: Scalable, Space-Constrained Visualization of Hierarchical Data. Heer and Card. Proc. Advanced Visual Interfaces (AVI), pp. 421–424, 2004.]

Idiom: Fisheye Lens

  • Distort geometry
    • Shape: radial
    • Focus: single extent
    • Extent: local
    • Metaphor: draggable lens

Fisheye

https://bost.ocks.org/mike/fisheye/by mbostock

Idiom: Stretch and Squish Navigation

System: TreeJuxtaposer
  • Distort geometry
    • Shape: rectilinear
    • Foci: multiple
    • Impact: global
    • Metaphor: stretch and squish, borders fixed
[https://youtu.be/GdaPj8a9QEo]
[TreeJuxtaposer: Scalable Tree Comparison Using Focus+Context With Guaranteed Visibility. Munzner, Guimbretiere, Tasiran, Zhang, and Zhou. ACM Transactions on Graphics (Proc. SIGGRAPH) 22:3 (2003), 453– 462.]

Distortion Costs and Benefits

  • Benefits
    • Combine focus and context information in single view
  • Costs
    • Length comparisons impaired
      • Network/tree topology comparisons unaffected: connection, containment
  • Effects of distortion unclear if original structure unfamiliar
  • Object constancy/tracking may be impaired
[https://www.youtube.com/watch?v=hm2oFBqVM9o]
[Living Flows: Enhanced Exploration of Edge-Bundled Graphs Based on GPU-Intensive Edge Rendering. Lambert, Auber, and Melançon. Proc. Intl. Conf. Information Visualisation (IV), pp. 523–530, 2010.]
University Of California at Berkeley logo

Exploratory Data Analysis (EDA)

What's in the Data?

Tukey

Exposure, the effective laying open of the data to display the unanticipated, is to us a major portion of data analysis. Formal statistics has given almost no guidance to exposure; indeed, it is not clear how the informality and flexibility appropriate to the exploratory character of exposure can be fitted into any of the structures of formal statistics so far proposed.

Nothing—not the careful logic of mathematics, not statistical models and theories, not the awesome arithmetic power of modern computers—nothing can substitute here for the flexibility of the informed human mind. Accordingly, both approaches and techniques need to be structured so as to facilitate human involvement and intervention.

Summary Statistics

  • Useful to look at clean data that you understand and trust
  • Can be misleading
  • Remember the datasaurus!

Data Munging

60%

Data Munging

Data Munging (cont.)

Data Quality Hurdles

  • Missing data
  • Erroneous values
  • Type conversion
  • Entity resolution
  • Data integration

More Bad Data

Data Filtering

5.77967973162 3.26834145824 0.06418251738 4.38979192127 4.68302244707 4.82366715649 4.68587041117 0.04360063509 5.90498807235 4.3618070355 0.0017977901 4.9891841837 4.56259294774 5.44050157565 5.19592386044 15.6959515181 3.22732340991 5.57228018649 3.7148892443 5.00286245308 4.68302244707 4.82366715649 4.68587041117 0.04360063509 5.90498807235 4.68302244707

The First Sign That a Visualization Is Good Is That It Shows You a Problem in Your Data

Wattenberg

Data Tranformations and Iteration

Looks Like This

Think of It as a Data Cube

Common Transformations

  • Normalize
  • Log
  • Power
  • Binning
  • Grouping

Histograms, histograms, histograms

A cornerstone in the EDA toolbox!

“Above all else show the data.” - Tufte

Correlation

Hypothesis Generation

Mantras

  • Be skeptical: What assumptions have been made?
  • Explore iteratively: Start simple, keep asking questions.
  • Avoid fixation: Use a variety of graphics to inspect more angles.

Paradoxes

Which One Has the Real Data?

Graphical Inference for Infovis

Iteration Demo

Check on NaNs

Polyhanna?

University Of California at Berkeley logo

What We Learned

  • Reduce
    • Items
    • Attributes
  • Aggregation
    • Item
    • Spatial
    • Time
  • Dimensionality Reduction
  • Embed, Focus and Context
  • Exploratory Data Analysis
University Of California at Berkeley logo