Data Science Professionals: A Global Community of Sharing

sylvie lafortune

Data Science Professionals: A Global Community of Sharing

Sylvie Lafortune reports on the 37th annual conference of the International Association for Social Science Information Services and Technology (IASSIST), held over 30 May – 3 June 2011 in Vancouver, British Columbia, Canada.

The IASSIST [1] Conference is a long-standing annual event which brings together researchers, statistical analysts as well as computer and information professionals interested in all aspects of research data, from discovery to reuse. This 37^th meeting spanned five days where participants could attend workshops, IASSIST business meetings and a myriad of presentations. This year, the event focused on the sharing of tools and techniques which ‘improves capabilities across disciplines and along the entire data life cycle’. It was hosted jointly by the libraries of Simon Fraser University and the University of British Columbia in the magnificent city of Vancouver.

For those readers who seldom work with data, I am including this interesting definition provided by Katherine McNeill, Massachusetts Institute of Technology (MIT), during her talk at the Pechu Kucha [2]: ‘Data are a bit like garlic. They look like a vegetable, their taste has a bit of a kick and you need a special tool to get the most out of them.’

Pre-conference Committee Meetings and Workshops

30 - 31 May 2011

A number of committee meetings of organisations such as the Data Documentation Initiative (DDI [3]) and the International Federation of Data Organizations for the Social Science took place on the first day. Nine workshops were offered on the second day, providing participants with the opportunity to learn in a hands-on setting. Some of the sessions included basic survey design, data management planning, mapping census data with ArcGIS tools, an introduction to Nesstar tools (data publishing and online analysis system) and R, the open source statistical analysis software.

Plenaries, Presentations, Pecha Kucha and Posters

1 – 3 June 2011

The main conference took place over three days and featured three plenaries, 28 concurrent sessions with more than 110 presentations, a 45-minute Pecha Kucha and a poster session ... a considerable tour de force! First time IASSISTers like myself had to navigate through presentations titles such as Creating a Linchpin for Financial Data: The need for a Legal Entity Identifier (LEI) or Arisddi, A Resource Instantiation System for DDI and Keeping the User out of the ditch: The Importance of Front-End Alignment and – unfortunately ? make difficult choices.

Opening Plenary: A Look at Census Taking in Canada

Ian McKinnon, Chair of the National Statistics Council

Ian McKinnon, Chair of the National Statistics Council [4], opened the conference by providing his perspective on the events surrounding the cancellation of the long form census in Canada and the subsequent implementation of the voluntary National Household Survey. McKinnon refers to this period, which led to the resignation of Chief Statistician, Munir Sheikh, in the summer of 2010, as the most ‘turbulent’ in the history of the census in Canada [5]. No one can predict to what extent the response bias (moving from a 20% sample mandatory census to a voluntary survey) will affect the quality of the data.

However, McKinnon commented on some of the socio-economic indicators which are more obviously at risk: urban aboriginals, new immigrants, young people transitioning from school to work, etc. Furthermore, public services such as local transit, health agencies and school boards will no longer be able to rely with confidence on data from the census. McKinnon emphasised the growing consciousness of governments around the world of concerns about the intrusiveness of the public census, and commented on the future of good quality socio-economic data collected by governments. He indicated that countries such as Canada will have to balance change with consistent reliable data points.

Co-organisers of the 37th IASSIST Conference: Mary Luebbe, Data Services Librarian, University of British Columbia Libraries and Walter Piovesan, Director, Research Data Services, Simon Fraser University.

He concluded by stating that models of administrative data collection such as population and dwelling registers could be increasingly adopted worldwide. How this will affect research in the Humanities and Social Sciences remains to be seriously examined.

Noteworthy Trends and Developments in Social Science Data

Data Discovery Tools

Researchers are increasingly demanding high-quality data and documentation which they can access themselves. So it is not surprising that many sessions focused on data standards and discovery tools. In fact, developments seem to be so fast-paced in these areas that three sessions were needed to cover recent developments in the Data Documentation Initiative (DDI) implementation, an international standard for describing data from the social sciences. Presentations ranged from new applications used to convert MS Word-based questionnaires to DDI, to emerging open source metadata management platforms for data repositories. Another session covered current developments in the description of qualitative data and the use of the QuDEX schema, an XML standard for qualitative data exchange.

Still another session had presentations on less conventional data discovery tools such as a data-exposing, -sharing and -analysis tool based on Linked Open Data (Leibniz Institute for the Social Sciences), and a visualisation application for spatial information that extend the traditional analysis methods used in the social sciences (Australian Data Archive). A final session emphasised work currently done on controlled vocabularies and ontologies to enriching metadata. Several presentations discussed the use of the Resource Description Framework (RDF) and text-mining algorithms such as Recommind Mindserver to automate categorisation tasks.

Data Management

With the establishment of new data management regulations in many countries, a number of libraries and organisations supporting research are now in the process of creating data management services to meet the needs of their researchers. A first session discussed current developments and pilot projects in various institutions. Activities are generally centered on developing consulting services for researchers, building support at the institutional level and creating partnerships, promotion and infrastructure.

A second session featured a panel discussion on the National Science Foundation Data Management Plan requirement established in 2010. Panelists emphasised the fact that, like many government-initiated data management plans, the NSF plan is broadly worded and although this is meant to encourage the development of community-driven standards, it often leads to uncertainty as to how to respond in the short term.

A third session included presentations on data management planning (DMP) approaches in countries which are leaders in this process: the UK, US and Australia. In the UK, a data management planning Web-based tool called DMP Online was developed by the Digital Curation Centre. In the US, the Interuniversity Consortium for Political and Social Research (ICPSR) published a list of elements for creating a DMP following an extensive gap analysis of existing recommendations for data management plans. In Australia, the University of Technology Sydney developed a DMP based on a user-needs approach rather than relying on a compliance-based system. This work produced data management checklists and guidelines, guides to data archives, metadata approaches as well as protocols and tools for promotion.

Finally, a fourth session on DMP had presentations on three specific projects aiming to curate, manage and share data. The first project involves building an open data repository for social science experimental data at the Institution for Social and Policy Studies (ISPS) at Yale University. The second project is called the Research Data MANTRA Project (2010-2011). It has produced extensive online learning materials which reflect best practice in research data management in three disciplines: social science, clinical psychology and geosciences. An important goal of this project is to train postgraduate students and early career researchers in DMP. The third project is a case study for qualitative research data management, sharing and reuse. This is an interesting piece of work lead by the Map & Data Library at the University of Toronto which involves original research data consisting of 30 interviews with, then, leading pioneers in Canadian Sociology, all born before 1930.

Data without Boundaries (DwB) Project

Governments are important producers of highly detailed microdata, but access to them, from within a country, often involves complex procedures to protect the anonymity of respondents. With international studies currently on the rise, how can researchers access data across boundaries in a reasonable and timely way? A session chaired by Roxanne Silbernan of the French National Data Committee for Humanities and Social Sciences and coordinator of the DwB Project, was devoted to the EU-7^th-Framework Programme which deals with ‘data across boundaries’ issues within the European context.

The DwB Project involves 27 partners from 12 European countries and the goal is to create an integrated model for accessing official data from all countries. On a practical level, this means implementing an easy and single point of access for researchers, to be linked to the CESSDA [6] portal. This project has enormous implications for other countries and for research in the humanities and social sciences.

Teaching Data in the Library and across the University

Traditionally, data service providers have been involved in helping users with data access and analysis. As such, many have developed very particular skills in a range of statistical methods, programs and data sources. One session which was well attended, looked at the emerging trend of teaching data in universities.

Four presentations focused on best practice in teaching with data. The first presentation discussed the role of the library in promoting statistical and data literacy and the experiences of embedded librarians such as the business and the data librarians at the University of North Carolina at Greensboro. The second presentation looked at models and opportunities such as course-specific instruction, workshops and instructional partnerships across the campus. The third presentation looked more specifically at strategies for teaching spatial data resources and software and a final presentation discussed data and statistical skills for UK social science students and how they can be acquired through the use of data resources like the Census and the Economic and Social Data Service.

Data Citation

Another interesting trend in the world of data is the push to encourage good data citation practice. Why should datasets be cited? To enhance access, sharing and management of research data. One session organised by the IASSIST Special Interest Group on Data Citation (SIGDC) discussed tracking data use and reuse in the literature, building data citations for discovery as well as the Interuniversity Consortium for Political and Social Research (ICPSR) efforts to influence the whole community of data users to make data citation common practice. Another session focused on the work of DataCite, an international consortium which, according to their site, ‘supports researchers by helping them to find, identify, and cite research datasets with confidence’.

Second Plenary: Research Data Infrastructure

Chuck Humphrey, Data Library, University of Alberta

Introduced as a ‘data warrior’, Chuck Humphrey’s reputation comes not only from his passion for data, but from the countless hours of work he has contributed to national and international committees dealing with issues of data throughout his career. His presentation focused on the current state of research data infrastructure on a global scale and where the social sciences fit within the broader E-science landscape. His starting point was: ‘What does infrastructure mean today?’ Through his study of one event and four initiatives [7], all of which have recently taken place or started, he has come to the conclusion that

there is no consensus on defining infrastructure;
research data infrastructure has become associated with e-Science and cyberinfrastructure;
infrastructure can be defined within a two-dimensional framework: technological-social and local-global; and
the difficult task is achieving the social change that is a vital part of infrastructure.

Humphrey’s final conclusion was that the social sciences are definitely on main street as evidenced by the many exciting initiatives happening internationally around social science research data infrastructure.

Closing Plenary: Open Data in Vancouver

The closing plenary was given by Ms. Andrea Reimer, Councillor for the City of Vancouver. An advocate of open government and open data, Reimer started by sharing her conviction that the quality of decisions made by governments are only as good as the quality of information available to the people. She described her involvement in the Vancouver’s ‘Open City 3’ initiative, where the citizens of Vancouver are encouraged to get involved, be active and empowered. How can this be achieved?

Councillor Reimer started by getting a motion on open data approved at City Council. This motion included three parts:

Open and accessible data;
Open standards; and
Open source software.

What were outcomes of open data for the city? She has observed increasing interaction with citizens to create innovative services. For example, apps were created by Vancouverites for locating parking spaces and getting information on waste disposal. What could be improved? Reimer admitted that promoting the use of data remained a challenge. Even more difficult was getting people to move beyond a passive use of information to that of an active one.

Conclusion

What have I learned from this first experience at an IASSIST conference? IASSISTers are passionate people! Although this was a large-scale event, there was much sharing and learning going on. Sessions were wide-ranging in topics and levels of required technical knowledge, so there was as much content for the neophyte as for the experienced data professional.

References

See the IASSIST Web site for a complete description of its membership, activities and publications http://iassistdata.org/
A Pecha Kucha is a style of presentation that involves showing 20 pictures for 20 seconds each.
The Data Documentation Initiative (DDI) is an effort to create an international standard in XML for metadata describing social science data. For more information on DDI, refer to the organisation’s Web site http://www.ddialliance.org/
For information on the National Statistics Council in Canada: http://unstats.un.org/unsd/dnss/docViewer.aspx?docID=193#start
For more details on the cancellation of the census long form in Canada: http://www.savethecensus.ca
Council of European Social Science Data Archives: http://www.cessda.org/
The event is the SPARC Digital Repositories Meeting, Nov. 2010. The four initiatives are: OECD Global Science Forum on Data and Social Science Infrastructure; Canadian Association of Research Libraries Research Data Management Infrastructure; Canadian Research Data Strategy Working Group Data Summit; Canadian International Polar Year (IPY) Data Assembly Centre Network.

Author Details

Sylvie Lafortune
Government Information and Data Librarian
Laurentian University
Sudbury
Ontario
Canada

Email: slafortune@laurentian.ca
Web site: http://www.laurentian.ca/library/

Sylvie Lafortune is an Associate Librarian at Laurentian University of Sudbury, one of the two bilingual universities in Canada. Her main area of responsibility is Government Information, Data and GIS. She is currently Chair of the Department of Library and Archives at her institution. For the past few years, she has enjoyed being the faculty advisor for the World University Service of Canada Local Committee at Laurentian.