|
1. Big Data Position1 #228901 Big Data refers to dataset that cannot be stored, captured, managed and analysed by the mean of conventional database software. | Introduction and definition Big Data refers to dataset that cannot be stored, captured, managed and analysed by the mean of conventional database software. Thereby Big Data is a subjective rather than a technical definition, because it does not involve a quantitative threshold (e.g. in terms of terabytes), but instead a moving technological one. Keeping that in mind, the definition of Big Data in many sectors ranges from a few terabytes2 to multiple petabytes3. The definition of Big Data does not merely involve the use of very large data sets, but concerns also a computational turn in thought and research (Burkholder, L, ed. 1992). As stated by Latour (2009) when the tool is changed, also the entire social theory going with it is different. In this view Big Data has emerged a system of knowledge that is already changing the objects of knowledge itself, as it has the capability to inform how we conceive human networks and community. Big Data creates a radical shift in how we think research itself. As argued by Lazer et al.(2009), not only we are offered the possibility to collect and analyze data at an unprecedented depth and scale, but also there is a change in the processes of research, the constitution of knowledge, the engagement with information and the nature and the categorization of reality. The potential stemming from the availability of a massive amount of data is exemplified by Google. It is widely believed that the success of the Mountain View company is due to its brilliant algorithms, e.g. PageRank. In reality the main novelties introduced in 1998, which brought to second generation search engines, involved the recognition that hyperlinks were an important measure of popularity and the use of the text of hyperlinks (anchortext) in the web index, giving it a weight close to the page title. This is because first generation search engines used only the text of the web pages, while Google added two data set (hyperlinks and anchortext), so that even a less than perfect algorithm exploiting this additional data would obtain roughly the same results as PageRank. Another example is the Google’s AdWords keyword auction model. Overture had previously shown that ranking advertisers for a given keyword based purely on their bids was an efficient mechanism. Google improved the tool by adding the data on the clickthrough rate (CTR) on each advertiser's ad, so that advertisers were ranked by their bid and their CTR. Why it matters in governance Big Data have a huge impact also in governance and policy making. In fact their benefits apply to a wide variety of subjects: - Health care: making care more preventive and personalized by relying on a home-based continuous monitoring, thereby reducing hospitalization costs while increasing quality. Detection of infectious disease outbreaks and epidemic development
- Education: by collecting all the data on students’ performance, it would be possible to design more effective approaches. The collection of these data is made possible thanks to massive Web deployment of educational activities
- Urban planning: huge high fidelity geographical datasets describing people and places are generated from administrative systems, cell phone networks, or other similar sources.
- Intelligent transportation based on the analysis and visualization of road network data, so as to implement congestion pricing systems and reduce traffic
- The use of ubiquitous data collection through sensors networks in order to improve environmental modelling
- Analysis and clarification of the energy pattern use through data analytics and smart meters, which can be useful for the adoption of energy saving policies avoiding blackouts
- Integrated analysis of contracts in order to find relations and dependencies among financial institutions in order to assess the financial systemic risk
- The analysis of conversation in social media and networks, as well as the analysis of financial transaction carried out by alleged terrorists, which can be used for homeland security
- Assessment of computer security by the mean of the logged information analysis, i.e. Security Information and Event Management
- Better track of food and pharmaceutical production and distribution chain
- Collect data on water and sewer usage in order to reduce water consumption by detecting leaks
- Use of sensors, GPS, cameras and communication systems for crisis detection, management and response
- Use of sensors’ data for carbon footprint management
Policy Applications of Big Data Tools There is a growing body of evidence highlighting the applications of Big Data not only in traditional hard science and business, but also in policy making due to the predictive power of the data. Let us see some applications: - Predictability of human behaviour and social events. A research team from Northwestern University4 was able to predict people’s location based on mobile phone information generated from past movements. Moreover Pentland from MIT5 conducted a research showing that mobile phones can be used as sensors for predicting human behaviour, as they can quantify human movements in order to explain changes in commuting patterns given for example by unemployment. Recently another research team from Northeastern University was able to predict the voting outcome in the scope of a famous US television programme (American Idol) based on Twitter activity during the time span defined by the TV show airing and the voting period following it6
- Public health. Online data can be used for syndromic surveillance, also called infodemiology7. As an example Google Flu Trends is a tool based on the prevalence of Google queries for flu-like symptoms. As shown by Ginsberg et al. (2008)8 it is then possible to use search queries to detect influenza epidemics in areas with a large population of web search users. In fact according to the US Center for Disease Control and Prevention (CDC)9 a great availability of data coming from online queries can help to detect epidemic outbursts before laboratory analysis. Another related tool is the Google Dengue Trend. In this view the analysis of health related Tweets in US by Paul and Dredze (2011)10 found a high correlation between the modeled and the actual flu rate. In the same way Twitter’s data can be analyzed to study the geographic spread of a virus or disease11. Finally we can talk about Healthmap12 in which data from online news, eyewitness reports, expert-curated discussions, official reports, are used to get a thorough view of the current global state of infectious diseases which is visualised on a map
- Global food security. The Food and Agriculture Organization of the UN (FAO) is chartered with ensuring that the world’s knowledge of food and agriculture is available to those who need it when they need it and in a form which they can access and use13. In fact human population will approach 9 billion by 2050, thereby it will be necessary to put in place policies aimed at ensuring a sufficient and fair distribution of resources. In fact the world food production will have to increase by 60% by increasing the agricultural production and fighting water scarcity. The online data portal to be launched by FAO will enhance planners’ and decision makers’ capacity to estimate agricultural production potentials and variability under different climate and resources scenarios
- Environmental analysis. In the last United Nations conference on climate (i.e. COP 17) taking place in 2011, The European Environment Agency, the geospatial software company Esri and Microsoft presented the network Eye on Earth14, which can be used to create an online site and group of services for scientists, researchers and policy makers in order to share and analyze environmental and geospatial data. Other three projects launched by these institutions at COP 17 include WaterWatch (using EEA’s water data); AirWatch, (about EEA’s air quality data); and finally NoiseWatch, which is a combination between environmental data with user-generated information provided by citizens. Moreover during 2010 United Nations climate meeting (COP 16) Google launched its own satellite and mapping service Google Earth Engine15, which is a combination of a computing platform, an open API and satellite imagery along 25 years. All these tools will be available to scientists, researchers and governmental agencies for analyzing the environmental conditions in order to make sustainability decisions. In this way the government of Mexico created a map of the country’s forest incorporating 53,000 Landsat images, which can be used by the federal authority and the NGOs to make decisions about land use and sustainable agriculture.
- Crisis management and anticipation. In occasion of the Haiti earthquake16: an European Commission’s Joint Research Center team used the damage reports mapped on the Ushahidi-Haiti platform17 to show that this crowdsourced data can help predict the spatial distribution of structural damage in Port-au-Prince. Their model based on 1645 SMS reports crowdsourced data almost perfectly predicts the structural damage of most affected areas reported in the World Bank-UNOSAT-JRC damage assessment performed by 600 experts from 23 countries in 66 days based on high resolution aerial imagery of structural damage. As for future developments, some researches18 highlight the fact that Big Data can be used for crisis management and anticipation by building up crisis observatories, i.e. laboratories devoted to the collecting and processing of enormous volumes of data on both natural systems and human techno-socio-economic systems, so as to gain early warnings of impending events. With those capacity would be possible to set up Crisis and Observatories for financial and economic, for armed conflicts, for crime and corruption, for social crisis, for health risks and disease spreading, for environmental changes.
- Global Development. An inspiring example is given by Global Pulse19, which is a Big Data based innovation programme fostered by the UN Secretary-General and aimed at harnessing today's new world of digital data and real-time analytics in order foster international development, protect the world's most vulnerable populations, and strengthen resilience to global shocks. The programme is rooted on three main pillars: research on new data indicators providing real-time understanding of community’s welfare as well as real-time feedback on policies; creation of a toolkit of free open-source software for mining real-time data useful for shared evidence-based decisions; the establishment of country-level innovation centres (Pulse Lab) where real-time data are applied to development challenges. The programme encompasses 5 main projects carried out with several partners:
- “Daily Tracking of Commodity Prices: the e-Bread Index”20, which investigates how scraping online prices could provide real-time insights on price dynamics
- “Unemployment through the Lens of Social Media”21, which relates the unemployment statistics with unemployment-related conversation from open social web
- “Twitter and the Perception of Crisis Related Stress”22, which investigates what indicators can help in understanding people’s concerns on food, fuel, finance, housing
- “Monitoring Food Security Issues through New Media”23, which finds emerging trends related to food security using text analysis, semantic clustering and networks theory
- “Global Snapshot of Wellbeing – Mobile Survey”24, aimed at experimenting new tools able of replicating the standards of traditional household surveys in real-time on a global scale
- Intelligence and security. As examples of governments’ commitment to Big Data for national security we can present the Cyber-Insider Threat (CINDER)25 program, which aims at developing new ways for detecting cyber espionage activities in military computer networks as well as at increasing the accuracy, rate and speed with which cyber threats are detected. Another example is the Anomaly Detection at Multiple Scales (ADAMS)26 program led by the Defense Advanced Research Project Agency (DARPA), which addresses the problem of anomaly-detection and characterization in massive data sets. The program will be initially applied to insider-threat detection, in which individual actions are recognized as anomalous with comparison to a background of routine network activity. Finally the Center of Excellence on Visualization and Data Analytics (CVADA) of the Department for Homeland Security (DHS) is leading a research effort on data that can be used by first responders to tackle with natural disasters and terrorists attacks, by law enforcement to border security concerns, or to detect explosives and cyber threats.
An Interesting Application: Smart Cities A Smart City is a public administration or authorities delivering services and infrastructure based on ICT which are easy to use, efficient, responsive, open and sustainable for the environment. We can identify six main dimensions27: - Smart economy, characterized by high standard of living and competitive elements: innovative and entrepreneurship, high productivity, flexibility of labour market, internationalism, ability to transform;
- Smart mobility, i.e. efficient public transportation system, local and international accessibility, availability of ICT-infrastructure, sustainability and safety;
- Smart environment (sustainability of natural resources): low pollution, protection of environment, natural attractiveness;
- Smart people, given by high level of human and intellectual capital, high level of qualification, lifelong learning, social and ethnic diversity, flexibility, creativity;
- Smart living (high quality of life); presence of cultural facilities, healthy environment conditions, individual safety, housing quality, education facilities, touristic attractiveness and social cohesion;
- Smart governance given by citizens’ participation in decision-making, the presence of public and social services and of transparent and open governance.
The combination of all the benefits stemming from Big Data in governance, make clear that the integration of heterogeneous data from various domains bear high potential to provide insights on cities. New technologies will unlock massive amounts of data about all the aspects of the city as well as its citizens. For instance new systems involving energy use at fixed locations (like house and office) are being implemented by the mean of smart metering as well as the integration of various information systems used to record pricing and activity. Another possibility is given by the extraction of positional and frequency data from social media such as Twitter, Facebook, Flickr and Foursquare. All this data will be used for fulfilling the Smart Cities targets. Let us take into account for instance the transportation system, where diagnosing and anticipating abnormal events such as traffic congestions requires integration of various data like traffic data, weather data, road conditions, or traffic light strategy. Another possibility will be given by e-inclusion technologies and open data for governance. One important example of the development of the Smart City concept at large scale is the New York City project “Roadmap for a Digital Future”28, which outlines a path to build on New York City's successes and establish it as the world's top-ranked Digital City, based on indices of internet access, open government, citizen engagement, and digital industry growth. Recent Trends Big Data is a fast growing phenomenon: as the Google CEO Eric Schmidt pointed out in 2010, currently in two days is created in the world as much information as it was from the appearance of man till 2003. Nowadays29 it is possible to store all the world’s music in a $600 worth disk drive, while Facebook content shared every month amounts to $30 billion. According to the forecast global data will grow at a 40% rate next year while the total IT spending will grow just by 5%. By 2010 users and companies will have store more than 13 exabytes of new data, which is over 50,000 times the data in the Library of Congress. Big Data is also a potential booster for the economy, bearing a $300 billion potential annual value to US health care as well as a $600 billion potential annual consumers surplus from using personal location data globally and a €250 potential annual value to European public administration. In fact the European Commission is expected to adopt an Open Data Strategy, i.e. a set of measures aimed at increasing government transparency and creating a €32 billion a year market for public data. Finally as reported last year by the McKinsey Global Institute30, the United States will need 140,000 to 190,000 more workers with deep analytical expertise and 1.5 million more data-literate managers. Always according to the McKinsey Global Institute the potential value of global personal location data is estimated to be $700 billion to end users, and it can result in an up to 50% decrease in product development and assembly costs. What’s the growth engine of big data? From one side more “old world” data is produced through “open governance” and digitization. From the other side “new world” data are created are continuously collected in domains such as “in silico” medicine, “in silico engineering” and internet science. Brand new fields of science are being created: computational chemistry, biology, economics, engineering, mechanics, neuroscience, geophysics, etc. etc. This is true also in humanities, such as the birth of computational social science, based on mobile phones and social network digital traces. A wide array of actors including humanities and social science academics, marketers, governmental organizations, educational institutions, and motivated individuals, are now engaged in producing, sharing, interacting with, and organizing data. All these developments are allowed by the rise of new technologies for data collections: web logs; RFID; sensor networks; social networks; social data (due to the Social data revolution), Internet text and documents; Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, biological; military surveillance; medical records; photography archives; video archives; large-scale eCommerce. Inspiring cases - The Ion ProtonTM Sequencer31.
- The NIH Human Connectome Project32
- The Models of Infectious Disease Agent Study33
- MyTransport.sg34
- UN Global Pulse35
Tools on the market Freely available tools There are not many cases of freely available tools for Big Data analysis on the market. The presence of freely available tools on the market bear many benefits, such as: - developers and analysts will use them to experiment with emerging types of data structure so as to develop new and different analytical procedures, he added
- developers and IT professionals contribute their findings and know-how back into the industry to drive knowledge exchange
The freely available tools permit to Overcome data limitations, simplify the analytical process and visualize results. The functionalities provided by these software are: - massively parallel processing (MPP) database product for large-scale analytics and next-gen data warehousing
- data-parallel implementations of statistical and machine learning methods
- visual data mining modelling
In this view are very important the free Big Data tools developed by Greenplum for data scientists and developers: MADlib and Alpine In-Database Miner36 and Greenplum HD Community Edition37. Some other software partially for free with important Big Data applications: KNIME38, Weka / Pentaho39, Rapid-I RapidAnalytics40, Rapid-I RapidMiner41. Finally there is R42, which although was not built for Big Data, it has interesting application in this realm. Enterprise-level software The enterprise-level software is adopted for the following functionalities: - open source software based on Apache Hadoop
- data storage platforms and other information infrastructure solutions
- shared-nothing massively parallel processing (MPP) database architectures
- dataflow engines, software interconnect technologies
- data discovery and exploration tools
- built-in text analytics, enterprise-grade security and administrative tools
- real-time analytic processing (RTAP) platforms
- software-as-a-service (SaaS)
- visualization features supporting exploratory and discovery analytics
- on-line analytical processing (OLAP)
- BI/DW (business intelligence and data warehousing)
- EDW (enterprise data warehousing)Examples of these software include: Tableau BI platform43; SAS Data Integration Studio44; SAS High Performance Analytics45; SAS On Demand46; SAND Analytic Platform47; SAP BEx48; SAP NetWeaver49; SAP In-Memory Appliance (SAP HANA)50; ParAccel Analytic Database (PADB)51; IBM Netezza52; IBM InfoSphere BigInsights53; IBM InfoSphere Streams54; Kognitio WX255; Kognitio Pablo56; EMC Greenplum Database57; Greenplum HD58; EMC Greenplum Data Computing Appliance59; Greenplum Chorus60; Cloudera Enterprise61, StatSoft Statistica62 .
- Some other software which have not been built specifically for Big Data applications, but nonetheless can be used for Big Data analytics are: Mathematica63, MatLab64 and Stata65.
Key challenges and gaps In order to enjoy all the potential stemming from Big Data it would be necessary to remove the technological barrier preventing the exchange of data, information and knowledge between, disciplines, as well as to integrate activities which are based on different ontological foundations. Even though Big Data have provided a lot of benefits, many challenges are still to be coped with. For instance Gartner (2011)66 argues that the challenges are not only given by the volume of data, but also by the variety (heterogeneity of data types and representation, semantic interpretation) and velocity (rate of data arrival and action timing). According to the recent research those advancements include67: - Data modelling challenges: data models coherent to the data representation needs; data models able to describe discipline specific aspects; data models for representation and query of data provenance and contextual information; data models and query languages representing and managing data uncertainty, and representing and querying data quality information
- Data management challenges: provide quality, cost-effective, reliable preservation and access to the data; protect property rights, privacy and security of sensible data; ensure data search and discovery across a wide variety of sources; connect data sets from different domains in order to create open linked data space data can be unstructured or semi-structured with no context; different data format; different data labels used for same data elements; different data entry conventions and vocabularies used; - data entry errors; data sets can be so large they cannot be effectively processed by a single machine; data parallelization and task parallelization68.
- Data service/tools challenges: data tools for most scientific disciplines are inadequate to support research in all its phases so that scientists are less productive than what they might be. In fact there is the need of software able to “clean”, analyse and visualize huge amounts of data. Moreover are missing data tools and policies for the ensuring the cross collaboration and fertilization among different disciplines and scientific realms
As for other issues concerning Big Data, Boyd and Crawford (2011) highlight some of them: - Relationship between automatic search and the definition of knowledge. at the beginning of the 20th century Ford introduced the mass production, automation and assembly line, reshaping not only the way things are produced, but also the general understanding of labor, the human relationship to work, and the society at large. Fordism consisted in breaking down holistic tasks into atomized and independent ones. In the same way Big Data is a new system of knowledge characterized by a computational turn in science leading to a change in the constitution of knowledge, the process of research and the categorization of reality. But as the Fordism had limits (indeed has been overcome by the Just in Time paradigm), also the specialized Big Data tools are not flawless. Big Data, as a new system of knowledge can change the very meaning of learning itself, with all the possibilities and limitations embedded in the systems of knowing
- Big Data may produce misleading claims of objectivity and accuracy. In the science there is a deep cleavage between qualitative and quantitative scientists. Apparently qualitative scientists would be engaged in creating and interpreting stories, while quantitative scientists would be in the business of producing facts. Needless to say, that is not case as all the objectivity claims come from subjects, who make subjective observation and choices. Moreover data analysis is based on a tons of assumptions (see for instance the asymptotic theory in statistics) and on the other hand even though a model may be mathematically or an experiment may be scientifically valid, the final interpretation is subjective. Other examples are the difficulty of integrating in a consistent way different datasets, the arbitrary choices inherent data cleaning and finally the fact that internet databases may well be affected by bias such as frictions and self selection. In this view, by increasing the quantification space, especially in social sciences, Big Data might support objectivity and accuracy claims which are not really grounded on good sense and reality.
- A higher quantity of data does not always mean better data. In all sciences there is a massive amount of literature (interpretation bias, design standardization, sampling mechanism and question bias, statistical significance and diagnostics) aimed at ensuring the consistency of data collection and analysis. Curiously, Big Data scientists sometimes assume a priori quality of their data and completely neglects the methodological issues proper of global sciences. A clear example is given by social media data, which are subject to self-selection bias as people using social media is not representative of the society itself. Even the definition of active user and account of a social media might not be innocuous: in fact it is estimated that 40% of Twitter’s users are merely “listeners”, i.e. do not proactively take part. Finally it has to be recognized that in my contexts high quality research is purposely carried out with a limited amount of data, such as for instance in game theory experimental analysis.
- Big Data and Ethical Issues. The use for research purposes of “public” data on social media websites opens the door to deontological issues. The problem is: can those data be used without any ethical of privacy consideration? Obviously Big Data is an emerging field of science, thereby ethical consideration are yet to be fully considered. How the researchers can be sure that their activity is not harmful for some of their subjects? On one hand is impossible to ask for data use permission from all the subjects present in a database. On the other hand, the mere fact that the data are available does not justify their use. Accountability to the field of research and accountability to the research subjects are the ethical keys for Big Data. In all the traditional fields of science, researcher must follow a series of professional standards aimed at protecting the rights and well being of human subjects. On the other hand the ethical implications of Big Data research are not yet clear.
- Digital divides created by Big Data. It is widely accepted that doing research on Big Data automatically involves having a quick and easy access to databases. This is not the case, as only social media companies have access to large datasets, and sell those data at a high price, offering only small data sets to university based researchers. So researchers with a considerable amount of founding or based inside those firms can have access to data that the outsiders will not. Thereby their methodologies and claims cannot be verified. In this view Big Data can create a new digital divide, between researchers belonging to the top universities and working with the top companies, and scholar belonging to the periphery. But the digital divide can be also skills based: in fact only people with a strong computational background are able to wrangle through APIs and analyse massive quantities of data. Concluding there is a new digital divide between the Big Data reach, who are able to analyze and to buy datasets, and belong to top universities and companies, and the Big Data poor, who are outsiders
Finally according to the UN69 the Big Data challenges can be divided along two main dimension. The data management: - Privacy. The development of new technologies always raises privacy concerns for individuals, companies and societies. This is a very crucial issue as privacy, safety and diversity are important for defending the freedom of citizens, and obviously companies have the right to retain their confidential information. In the era of Big Data, the primary producers, who are the citizens using services and devices generating data, are seldom aware that they are doing so or how their data will be used. Sometimes it is also unclear to what extent users of social media such as Twitter consent to the analysis of their data. The pool of individual information shared by mobile phones and credit card companies, social media and research engines is simply astonishing. People must be conscious of that, as privacy is a freedom pillar.
- Access and sharing. A great amount of data is available online for the most disparate uses. On the other hand much data is retained by companies which are concerned about their reputation, the necessity to protect their competitiveness or simply lack the right incentive to do so. On the other hand there is a bunch of technical and regulatory arrangements which has to be put in place in order to ensure inter-comparability of data and interoperability of systems.
Data analysis: - Summarising the data. Sometimes the data might be simply false or fabricated, especially with user-generated text-based data (blogs, news, social media messages). In addition sometimes data are derived from people’s perceptions, as in calls to health hotlines and online searches for symptoms. Another case is related to opinion mining and sentiment analysis, in which the true significance of the statements can be misled, so that the human factor is always crucial in the analysis. Another problem is that sometimes data are generated from expressed intentions in blogposts, online searches, mobile-phone systems for checking market price, which are not a sure indicator of actual intentions and final decisions. So there is a huge problem in summarizing facts from users’ generated text, as there might be a difficulty in distinguishing feeling from facts.
- Interpreting data. A very important concern is given by the sample selection bias, given by the fact that people generating data are not representative of the entire population. For instance younger generations use more internet and mobile devices. In this way the conclusions of the analysis are valid only for the sample at hand and cannot therefore be generalized. Sometimes dealing with huge amounts of data leads the researchers to focus on finding patterns or correlations without concentrating on the underlying dynamics. One thing is to find a correlation, another is to detect a causal relationship. Even more difficult is to identify the direction of the causal relationship without using a founding theory. A final issue is very much linked with using data from different sources, which can magnify the existing flaws in each database
Finally we have the challenges identified by the community white paper drafted with the collaboration of a group of leading researchers across the United States70: - Heterogeneity and incompleteness. Data must be structured prior to the analysis in an homogeneous way, as algorithms unlike humans are not able to grasp nuance. Most computer systems work better if multiple items are stored in an identical size and structure. But an efficient representation, access and analysis of semi-structured data is necessary because as a less structured design is more useful for certain analysis and purposes. Even after cleaning and error correction in the database, some errors and incompleteness will remain, challenging the precision of the analysis.
- Human collaboration. Even if analytical instruments gained tremendous advancements, there are still many realms in which the human factor is able to discover patterns algorithms cannot. An example can be found in the use of CAPTCHAs, which can discern human users from computer programmes. In this view a Big Data system cannot must involve a human presence. Given the complexity of today’s world, there is the necessity to harness human ingenuity from different domains through crowdsourcing. Thereby a Big Data system requires technologies able to support this kind of collaboration even in case of conflicting statements and judgments.
Current Big Data Techniques Big datasets can be analysed by the mean of several techniques coming from statistics and computers science. A list of the principal categories is: - Cluster analysis. Statistical technique consisting in splitting an heterogeneous group into smaller subsets of similar elements, whose characteristics of similarities are not known in advance. A typical example is to identify consumers with similar patterns of past purchases in order to tailor most accurately a given marketing strategy
- Crowdsourcing. Technique for the collection of data which have been drawn from a large group or community in response to an open call through a networked media such as the internet. This category bears a crucial importance in our case as it is a mass collaboration instance of using Web 2.0
- Data mining. Combination of database management, statistics and machine learning methods useful for extracting patterns from large datasets. Some examples include mining human resources data in order to assess some employee characteristics or consumer bundle analysis to model the behavior of customers
- Machine learning. Subfield of computer science (in the scope of artificial intelligence) regarding the definition and the implementation of algorithms allowing computers to evolve their behaviour based on empirical evidence. An example of machine learning is the natural language processing.
- Natural language processing. Set of computer science and linguistic methods adopting algorithms to analyse natural human language. Basically this field, which began as a branch of artificial intelligence, deals with the interaction between computer and human language
- Neural networks. Computational models which are structured and work similarly to biological neural networks existing among brain cells, and that are used to find in particular non-linear patterns in the data. Some applications include game-playing and decision making (backgammon, chess, poker) and knowledge discovery in data bases
- Network analysis. Part of graph theory and network science which describes the relationships among discrete nodes in a graph or a network. In particular the social network analysis studies the structure of relationship among social entities. Some applications are include the role of trust in exchange relationships and the study of recruitment into political movements and social organizations
- Predictive modelling. Branch a mathematical model used to best predict the probability of an outcome. This technique is widely used in customer relationship management to produce customer-level models able to assess the probability that a customer would take a particular action, such as cross-sell, product deep-sell and churn
- Regression. Statistical method for assessing how the value of a dependent variable changes with one or more dependent variables. Examples of applications include the change in consumer’s behaviour due to manufacturing parameters or economic fundamentals
- Sentiment analysis. Natural language processing methods for extracting information such as polarity, degree and strength of the sentiment over a given feature, aspect of product. Many companies assess how different customers and stakeholders react to their products and action by applying this analysis to blogs, social networks and other social media
- Spatial analysis. Methods for assessing the geographical, geometric or topological characteristics of a data set. The spatial data are often drawn from geographical information systems (GIS) including addresses or latitude/longitude coordinates, to be incorporated into spatial regressions (correlation between commodity price and location) or simulations
- Simulation. Consists in modelling the behavior of a complex system for performing forecast and scenario analysis. As example we can mention Monte Carlo simulations, which are a class of computational algorithms that rely on repeated random sampling to compute their results
Current and Future Research - Technologies for collecting cleaning, storing and managing data: datawarehouse; pivotal transformation; ETL; I/O; efficient archiving, storing, indexing, retrieving, and recovery; streaming, filtering, compressed sensing sufficient statistics; automatic data annotation; Large Database Management Systems; storage architectures; data validity, integrity, consistency, uncertainty management; languages, tools, methodologies and programming environments
- Technologies for summarizing data and extracting some meaning: reports; dashboard; statistical analysis and inference; Bayesian techniques; information extraction from unstructured, multimodal data; scalable and interactive data visualization; extraction and integration of knowledge from massive, complex, multi-modal, or dynamic data; data mining; scalable machine learning; data-driven high fidelity simulations; scalable machine learning; predictive modelling, hypothesis generation and automated discovery
- Technologies for using data a decision tool: Decision Trees, Pro-Con Analysis, Rule Based Systems, Neural Networks, Tradeoff based Decisions (which incorporates Reporting, Statistics, Knowledge Based systems)
|
|
|