Chapter 7: Big Data and NoSQL Databases

7.1 Introduction to Big Data

Big Data refers to extremely large datasets that may be analyzed computationally to reveal patterns, trends, and associations.

7.1.1 The Three V's of Big Data

Volume: The quantity of generated and stored data
Velocity: The speed at which the data is generated and processed
Variety: The different types of data

7.1.2 Big Data Technologies

Hadoop: An open-source framework for distributed storage and processing
Apache Spark: A unified analytics engine for large-scale data processing
Apache Kafka: A distributed streaming platform

7.2 NoSQL Database Types

NoSQL (Not Only SQL) databases are designed to handle the volume, velocity, and variety of Big Data.

7.2.1 Document Databases

Store data in flexible, JSON-like documents.

Example: MongoDB

db.users.insert({
  name: "John Doe",
  age: 30,
  email: "john@example.com",
  interests: ["programming", "data science", "machine learning"]
})

7.2.2 Key-Value Stores

Store data as a collection of key-value pairs.

Example: Redis

SET user:1000 "John Doe"
GET user:1000

7.2.3 Column-family Stores

Store data in columns rather than rows.

Example: Cassandra

CREATE TABLE users (
  user_id uuid PRIMARY KEY,
  name text,
  email text
);

7.2.4 Graph Databases

Store data in graph structures with nodes, properties, and lines.

Example: Neo4j

CREATE (john:Person {name: "John Doe", age: 30})
CREATE (jane:Person {name: "Jane Smith", age: 28})
CREATE (john)-[:KNOWS]->(jane)

7.3 When to Use NoSQL over Relational Databases

Handling large volumes of unstructured or semi-structured data
Need for high scalability and performance
Flexible schema requirements
Geographically distributed data

7.4 CAP Theorem

The CAP theorem states that it's impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

Consistency: Every read receives the most recent write or an error
Availability: Every request receives a response, without guarantee that it contains the most recent version of the information
Partition tolerance: The system continues to operate despite arbitrary partitioning due to network failures

NoSQL databases often prioritize availability and partition tolerance over consistency.

Let's design a NoSQL database using MongoDB for a social media application.

Step 1: Data Modeling

In MongoDB, we'll use a document model to represent users and posts.

User Document:

{
  _id: ObjectId("5f8a7b2b9d3b2a1b1c9d1e1f"),
  username: "johndoe",
  email: "john@example.com",
  profile: {
    name: "John Doe",
    age: 30,
    location: "New York"
  },
  friends: [
    ObjectId("5f8a7b2b9d3b2a1b1c9d1e2f"),
    ObjectId("5f8a7b2b9d3b2a1b1c9d1e3f")
  ],
  created_at: ISODate("2023-09-15T10:30:00Z")
}

Post Document:

{
  _id: ObjectId("5f8a7b2b9d3b2a1b1c9d1e4f"),
  user_id: ObjectId("5f8a7b2b9d3b2a1b1c9d1e1f"),
  content: "Hello, MongoDB!",
  likes: 10,
  comments: [
    {
      user_id: ObjectId("5f8a7b2b9d3b2a1b1c9d1e2f"),
      content: "Great post!",
      created_at: ISODate("2023-09-15T11:00:00Z")
    }
  ],
  created_at: ISODate("2023-09-15T10:45:00Z")
}

Step 2: Implementing CRUD Operations

Create a new user:

db.users.insertOne({
  username: "janedoe",
  email: "jane@example.com",
  profile: {
    name: "Jane Doe",
    age: 28,
    location: "San Francisco"
  },
  friends: [],
  created_at: new Date()
})

Read user data:

db.users.find({ username: "johndoe" })

Update user profile:

db.users.updateOne(
  { username: "johndoe" },
  { $set: { "profile.age": 31 } }
)

Delete a post:

db.posts.deleteOne({ _id: ObjectId("5f8a7b2b9d3b2a1b1c9d1e4f") })

Add a friend:

db.users.updateOne(
  { username: "johndoe" },
  { $push: { friends: ObjectId("5f8a7b2b9d3b2a1b1c9d1e2f") } }
)

Create a new post:

db.posts.insertOne({
  user_id: ObjectId("5f8a7b2b9d3b2a1b1c9d1e1f"),
  content: "NoSQL is awesome!",
  likes: 0,
  comments: [],
  created_at: new Date()
})

Add a comment to a post:

db.posts.updateOne(
  { _id: ObjectId("5f8a7b2b9d3b2a1b1c9d1e4f") },
  { 
    $push: { 
      comments: {
        user_id: ObjectId("5f8a7b2b9d3b2a1b1c9d1e2f"),
        content: "I agree!",
        created_at: new Date()
      } 
    } 
  }
)

Step 4: Querying and Indexing

Find all posts by a user:

db.posts.find({ user_id: ObjectId("5f8a7b2b9d3b2a1b1c9d1e1f") })

Create an index for faster querying:

db.posts.createIndex({ user_id: 1, created_at: -1 })

Find recent posts by friends:

db.posts.aggregate([
  { $lookup: {
      from: "users",
      localField: "user_id",
      foreignField: "_id",
      as: "user"
    }
  },
  { $match: {
      "user.friends": ObjectId("5f8a7b2b9d3b2a1b1c9d1e1f")
    }
  },
  { $sort: { created_at: -1 } },
  { $limit: 50 }
])

This example demonstrates how a NoSQL database like MongoDB can be used to build a scalable social media application. The flexible schema allows for easy addition of new features, while the document model provides a natural fit for the hierarchical and interconnected nature of social media data.

7.1 Introduction to Big Data​

7.1.1 The Three V's of Big Data​

7.1.2 Big Data Technologies​

7.2 NoSQL Database Types​

7.2.1 Document Databases​

7.2.2 Key-Value Stores​

7.2.3 Column-family Stores​

7.2.4 Graph Databases​

7.3 When to Use NoSQL over Relational Databases​

7.4 CAP Theorem​

7.5 Real-life Example: Using MongoDB for a Social Media Application​

Step 1: Data Modeling​

Step 2: Implementing CRUD Operations​

Step 3: Implementing Social Features​

Step 4: Querying and Indexing​