Finding the duration of how long a search has been popular











up vote
2
down vote

favorite
1












Program:
I have designed a program in python that does the following (in order):




  1. Webscrapes google trends for the most popular searches

  2. Stores the searches and traffic of the searches in an in-memory database along with a timestamp

  3. Finds the trends that no longer exist (I.E the trends that are no longer in googles top 20 most popular) and then searches the database for the rows in which the trend first appeared and when it disappeared in order to find how long it has been popular (I.E the duration the trend spent in googles top 20 most popular searches)


Context:
I have only been programming for a month now and my experience is entirely limited to python. As a result, I'm 100% certain that I'm using atrocious practices and my code is super inefficient. Hence, these are my two big concerns: 1) My program is extremely slow. 2) It's clunky and looks unpleasant to a professional.



Issues:
Here I will try to be as comprehensive as possible as to what I think is wrong with my program:



It is slow. Particularly in my third script (which corresponds to the third step) there is a giant block of code in the function inst_range. I used two while loops to search from either end of the table. If the program finds the search while counting up the rows it treats the search as the first instance of the search that exists. Likewise, if it finds the search going backwards through the table it treats it as the last instance of the search that exists. Then it computes the time between these two rows to find how long the search has been popular. What is the most efficient, clear way I could express this in my code while maintaining speed?



I use bad practices / it looks unappealing My third script is the main culprit again. It looks very clunky to me. When I look at really good github projects they are all so readable and nice. How do I apply that to my project? Another thing I worry about is alot of the complex stuff I don't yet understand. I.E caching, threading, and different types of sorting which I should probably be using but I don't know about. Sorry to tack this bit on as well, but should I be using more OOP in my program? Is it messed up?



The code:



trend_parser.py (corresponds to step 1)



import re
import requests
from bs4 import BeautifulSoup

def fetch_xml(country_code):
"""Initialises a request to the Google Trends RSS feed

Returns:
string: Unparsed html stored as a string for use by BeautifulSoup
"""
try:
url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
response = requests.get(url)
return response.content
except requests.exceptions.ConnectionError:
print("failed to connect")

def trends_retriever(country_code):
"""Parses the Google Trends RSS feed using BeautifulSoup.

Returns:
dict: Trend title for key, trend approximate traffic for value.
"""
xml_document = fetch_xml(country_code)
soup = BeautifulSoup(xml_document, "lxml")
titles = soup.find_all("title")
approximate_traffic = soup.find_all("ht:approx_traffic")
return {title.text: re.sub("[+,]", "", traffic.text)
for title, traffic in zip(titles[1:], approximate_traffic)}


models.py (corresponds to step 2)



from trend_parser import trends_retriever
from flask import Flask
from datetime import datetime
import os

from flask_sqlalchemy import SQLAlchemy

app = Flask(__name__)
app.config["SQLALCHEMY_TRACK_MODIFICATIONS"] = False
app.config["SQLALCHEMY_DATABASE_URI"] = "sqlite:///trends.sqlite3"
app.config["SECRET_KEY"] = os.urandom(16)

db = SQLAlchemy(app)

class trends(db.Model):

id = db.Column(db.Integer, primary_key = True)
title = db.Column(db.String, nullable = False)
traffic = db.Column(db.Integer, nullable = False)
time = db.Column(db.DateTime, nullable = False)

def __init__(self):
"""These are the parameters that are passed on to the database"""
self.title = f"{list(trends_retriever('US').keys())}"
self.traffic = f"{list(trends_retriever('US').values())}"
self.time = datetime.now()


database_crawler.py (corresponds to step 3)



from models import trends
import ast
import itertools

queried_titles = [ast.literal_eval(item.title) for item in trends.query.all()]
queried_time = [item.time for item in trends.query.all()]


def title_query():
for index, item in enumerate(itertools.count()):
try:
first_row = set(queried_titles[index])
second_row = set(queried_titles[index + 1])
removed_trend = first_row - second_row
yield removed_trend
except IndexError:
break


def inst_range():
removed_trends = list(itertools.chain(*title_query()))
row_count, row_count2 = 0, -1
for item in removed_trends:
while item not in queried_titles[row_count]:
row_count += 1
if item in queried_titles[row_count]:
first_instance = row_count
row_count = 0
while item not in queried_titles[row_count2]:
row_count2 -= 1
if item in queried_titles[row_count2]:
last_instance = len(queried_titles) - abs(row_count2)
row_count2 = -1
first_time = queried_time[first_instance]
last_time = queried_time[last_instance]
time_difference = (last_time - first_time).total_seconds() / 3600
yield item, round(time_difference, 2)

print(dict(inst_range()))


Outputs:



Visualisation of database



Output of the third script (value corresponds to duration of staying in the top 20 searches, key corresponds to title of search)



{'Chiefs': 0.24, 'The Mule': 4.28, 'KFC Firelog': 0.62, 'Yeezy': 1.46, 
'Chiko Juan': 0.96, 'Dragon Ball Super: Broly full movie': 0.0, 'Roma':
4.28, 'Imagine Ariana Grande': 4.28, 'Canelo vs Rocky': 4.28, 'Chiefs
schedule': 4.28, 'Nfl playoff picture 2019': 0.0, 'Mick Mulvaney': 4.28,
'Nancy Wilson': 4.28, 'Johnson and Johnson': 4.28, 'JWoww': 4.28,
'Jabari Parker': 4.28, 'Chick-fil-A': 4.28, 'Fuller House': 2.82, 'Artie
Lange': 4.28, 'Al Roker': 4.28, 'Obamacare': 0.0, 'Bill Fralic': 4.28,
'Sandy Hook': 4.28, 'Aus vs Ind': 4.28, "Shareef O'Neal": 4.28,
'Jennifer Lawrence': 0.0, 'Alcorn State': 0.0, 'Spider-Man Spider-
Verse': 0.0, 'NFL scores': 0.0, 'Pete Davidson': 0.0, 'Chivas': 0.0,
'Juventus': 0.0, 'NFL games': 0.0, 'Fresno State football': 0.0,
'Browns': 0.0, 'Lakers vs Hornets': 0.0, 'Matt Damon': 0.0, 'DAZN': 0.0,
'Stoney Westmoreland': 0.0, 'Canelo fight December 15, 2018': 0.0, 'UNC
basketball': 0.0, 'NFL': 0.0, 'Texans': 0.0, 'Nebraska volleyball': 0.0,
'NFL schedule': 0.0, 'Chicago Bears': 0.09, 'Offset Cardi B': 0.09,
'Appalachian State football': 0.09, '76ers vs Cavaliers': 0.09}
[Finished in 0.4s]









share|improve this question


























    up vote
    2
    down vote

    favorite
    1












    Program:
    I have designed a program in python that does the following (in order):




    1. Webscrapes google trends for the most popular searches

    2. Stores the searches and traffic of the searches in an in-memory database along with a timestamp

    3. Finds the trends that no longer exist (I.E the trends that are no longer in googles top 20 most popular) and then searches the database for the rows in which the trend first appeared and when it disappeared in order to find how long it has been popular (I.E the duration the trend spent in googles top 20 most popular searches)


    Context:
    I have only been programming for a month now and my experience is entirely limited to python. As a result, I'm 100% certain that I'm using atrocious practices and my code is super inefficient. Hence, these are my two big concerns: 1) My program is extremely slow. 2) It's clunky and looks unpleasant to a professional.



    Issues:
    Here I will try to be as comprehensive as possible as to what I think is wrong with my program:



    It is slow. Particularly in my third script (which corresponds to the third step) there is a giant block of code in the function inst_range. I used two while loops to search from either end of the table. If the program finds the search while counting up the rows it treats the search as the first instance of the search that exists. Likewise, if it finds the search going backwards through the table it treats it as the last instance of the search that exists. Then it computes the time between these two rows to find how long the search has been popular. What is the most efficient, clear way I could express this in my code while maintaining speed?



    I use bad practices / it looks unappealing My third script is the main culprit again. It looks very clunky to me. When I look at really good github projects they are all so readable and nice. How do I apply that to my project? Another thing I worry about is alot of the complex stuff I don't yet understand. I.E caching, threading, and different types of sorting which I should probably be using but I don't know about. Sorry to tack this bit on as well, but should I be using more OOP in my program? Is it messed up?



    The code:



    trend_parser.py (corresponds to step 1)



    import re
    import requests
    from bs4 import BeautifulSoup

    def fetch_xml(country_code):
    """Initialises a request to the Google Trends RSS feed

    Returns:
    string: Unparsed html stored as a string for use by BeautifulSoup
    """
    try:
    url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
    response = requests.get(url)
    return response.content
    except requests.exceptions.ConnectionError:
    print("failed to connect")

    def trends_retriever(country_code):
    """Parses the Google Trends RSS feed using BeautifulSoup.

    Returns:
    dict: Trend title for key, trend approximate traffic for value.
    """
    xml_document = fetch_xml(country_code)
    soup = BeautifulSoup(xml_document, "lxml")
    titles = soup.find_all("title")
    approximate_traffic = soup.find_all("ht:approx_traffic")
    return {title.text: re.sub("[+,]", "", traffic.text)
    for title, traffic in zip(titles[1:], approximate_traffic)}


    models.py (corresponds to step 2)



    from trend_parser import trends_retriever
    from flask import Flask
    from datetime import datetime
    import os

    from flask_sqlalchemy import SQLAlchemy

    app = Flask(__name__)
    app.config["SQLALCHEMY_TRACK_MODIFICATIONS"] = False
    app.config["SQLALCHEMY_DATABASE_URI"] = "sqlite:///trends.sqlite3"
    app.config["SECRET_KEY"] = os.urandom(16)

    db = SQLAlchemy(app)

    class trends(db.Model):

    id = db.Column(db.Integer, primary_key = True)
    title = db.Column(db.String, nullable = False)
    traffic = db.Column(db.Integer, nullable = False)
    time = db.Column(db.DateTime, nullable = False)

    def __init__(self):
    """These are the parameters that are passed on to the database"""
    self.title = f"{list(trends_retriever('US').keys())}"
    self.traffic = f"{list(trends_retriever('US').values())}"
    self.time = datetime.now()


    database_crawler.py (corresponds to step 3)



    from models import trends
    import ast
    import itertools

    queried_titles = [ast.literal_eval(item.title) for item in trends.query.all()]
    queried_time = [item.time for item in trends.query.all()]


    def title_query():
    for index, item in enumerate(itertools.count()):
    try:
    first_row = set(queried_titles[index])
    second_row = set(queried_titles[index + 1])
    removed_trend = first_row - second_row
    yield removed_trend
    except IndexError:
    break


    def inst_range():
    removed_trends = list(itertools.chain(*title_query()))
    row_count, row_count2 = 0, -1
    for item in removed_trends:
    while item not in queried_titles[row_count]:
    row_count += 1
    if item in queried_titles[row_count]:
    first_instance = row_count
    row_count = 0
    while item not in queried_titles[row_count2]:
    row_count2 -= 1
    if item in queried_titles[row_count2]:
    last_instance = len(queried_titles) - abs(row_count2)
    row_count2 = -1
    first_time = queried_time[first_instance]
    last_time = queried_time[last_instance]
    time_difference = (last_time - first_time).total_seconds() / 3600
    yield item, round(time_difference, 2)

    print(dict(inst_range()))


    Outputs:



    Visualisation of database



    Output of the third script (value corresponds to duration of staying in the top 20 searches, key corresponds to title of search)



    {'Chiefs': 0.24, 'The Mule': 4.28, 'KFC Firelog': 0.62, 'Yeezy': 1.46, 
    'Chiko Juan': 0.96, 'Dragon Ball Super: Broly full movie': 0.0, 'Roma':
    4.28, 'Imagine Ariana Grande': 4.28, 'Canelo vs Rocky': 4.28, 'Chiefs
    schedule': 4.28, 'Nfl playoff picture 2019': 0.0, 'Mick Mulvaney': 4.28,
    'Nancy Wilson': 4.28, 'Johnson and Johnson': 4.28, 'JWoww': 4.28,
    'Jabari Parker': 4.28, 'Chick-fil-A': 4.28, 'Fuller House': 2.82, 'Artie
    Lange': 4.28, 'Al Roker': 4.28, 'Obamacare': 0.0, 'Bill Fralic': 4.28,
    'Sandy Hook': 4.28, 'Aus vs Ind': 4.28, "Shareef O'Neal": 4.28,
    'Jennifer Lawrence': 0.0, 'Alcorn State': 0.0, 'Spider-Man Spider-
    Verse': 0.0, 'NFL scores': 0.0, 'Pete Davidson': 0.0, 'Chivas': 0.0,
    'Juventus': 0.0, 'NFL games': 0.0, 'Fresno State football': 0.0,
    'Browns': 0.0, 'Lakers vs Hornets': 0.0, 'Matt Damon': 0.0, 'DAZN': 0.0,
    'Stoney Westmoreland': 0.0, 'Canelo fight December 15, 2018': 0.0, 'UNC
    basketball': 0.0, 'NFL': 0.0, 'Texans': 0.0, 'Nebraska volleyball': 0.0,
    'NFL schedule': 0.0, 'Chicago Bears': 0.09, 'Offset Cardi B': 0.09,
    'Appalachian State football': 0.09, '76ers vs Cavaliers': 0.09}
    [Finished in 0.4s]









    share|improve this question
























      up vote
      2
      down vote

      favorite
      1









      up vote
      2
      down vote

      favorite
      1






      1





      Program:
      I have designed a program in python that does the following (in order):




      1. Webscrapes google trends for the most popular searches

      2. Stores the searches and traffic of the searches in an in-memory database along with a timestamp

      3. Finds the trends that no longer exist (I.E the trends that are no longer in googles top 20 most popular) and then searches the database for the rows in which the trend first appeared and when it disappeared in order to find how long it has been popular (I.E the duration the trend spent in googles top 20 most popular searches)


      Context:
      I have only been programming for a month now and my experience is entirely limited to python. As a result, I'm 100% certain that I'm using atrocious practices and my code is super inefficient. Hence, these are my two big concerns: 1) My program is extremely slow. 2) It's clunky and looks unpleasant to a professional.



      Issues:
      Here I will try to be as comprehensive as possible as to what I think is wrong with my program:



      It is slow. Particularly in my third script (which corresponds to the third step) there is a giant block of code in the function inst_range. I used two while loops to search from either end of the table. If the program finds the search while counting up the rows it treats the search as the first instance of the search that exists. Likewise, if it finds the search going backwards through the table it treats it as the last instance of the search that exists. Then it computes the time between these two rows to find how long the search has been popular. What is the most efficient, clear way I could express this in my code while maintaining speed?



      I use bad practices / it looks unappealing My third script is the main culprit again. It looks very clunky to me. When I look at really good github projects they are all so readable and nice. How do I apply that to my project? Another thing I worry about is alot of the complex stuff I don't yet understand. I.E caching, threading, and different types of sorting which I should probably be using but I don't know about. Sorry to tack this bit on as well, but should I be using more OOP in my program? Is it messed up?



      The code:



      trend_parser.py (corresponds to step 1)



      import re
      import requests
      from bs4 import BeautifulSoup

      def fetch_xml(country_code):
      """Initialises a request to the Google Trends RSS feed

      Returns:
      string: Unparsed html stored as a string for use by BeautifulSoup
      """
      try:
      url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
      response = requests.get(url)
      return response.content
      except requests.exceptions.ConnectionError:
      print("failed to connect")

      def trends_retriever(country_code):
      """Parses the Google Trends RSS feed using BeautifulSoup.

      Returns:
      dict: Trend title for key, trend approximate traffic for value.
      """
      xml_document = fetch_xml(country_code)
      soup = BeautifulSoup(xml_document, "lxml")
      titles = soup.find_all("title")
      approximate_traffic = soup.find_all("ht:approx_traffic")
      return {title.text: re.sub("[+,]", "", traffic.text)
      for title, traffic in zip(titles[1:], approximate_traffic)}


      models.py (corresponds to step 2)



      from trend_parser import trends_retriever
      from flask import Flask
      from datetime import datetime
      import os

      from flask_sqlalchemy import SQLAlchemy

      app = Flask(__name__)
      app.config["SQLALCHEMY_TRACK_MODIFICATIONS"] = False
      app.config["SQLALCHEMY_DATABASE_URI"] = "sqlite:///trends.sqlite3"
      app.config["SECRET_KEY"] = os.urandom(16)

      db = SQLAlchemy(app)

      class trends(db.Model):

      id = db.Column(db.Integer, primary_key = True)
      title = db.Column(db.String, nullable = False)
      traffic = db.Column(db.Integer, nullable = False)
      time = db.Column(db.DateTime, nullable = False)

      def __init__(self):
      """These are the parameters that are passed on to the database"""
      self.title = f"{list(trends_retriever('US').keys())}"
      self.traffic = f"{list(trends_retriever('US').values())}"
      self.time = datetime.now()


      database_crawler.py (corresponds to step 3)



      from models import trends
      import ast
      import itertools

      queried_titles = [ast.literal_eval(item.title) for item in trends.query.all()]
      queried_time = [item.time for item in trends.query.all()]


      def title_query():
      for index, item in enumerate(itertools.count()):
      try:
      first_row = set(queried_titles[index])
      second_row = set(queried_titles[index + 1])
      removed_trend = first_row - second_row
      yield removed_trend
      except IndexError:
      break


      def inst_range():
      removed_trends = list(itertools.chain(*title_query()))
      row_count, row_count2 = 0, -1
      for item in removed_trends:
      while item not in queried_titles[row_count]:
      row_count += 1
      if item in queried_titles[row_count]:
      first_instance = row_count
      row_count = 0
      while item not in queried_titles[row_count2]:
      row_count2 -= 1
      if item in queried_titles[row_count2]:
      last_instance = len(queried_titles) - abs(row_count2)
      row_count2 = -1
      first_time = queried_time[first_instance]
      last_time = queried_time[last_instance]
      time_difference = (last_time - first_time).total_seconds() / 3600
      yield item, round(time_difference, 2)

      print(dict(inst_range()))


      Outputs:



      Visualisation of database



      Output of the third script (value corresponds to duration of staying in the top 20 searches, key corresponds to title of search)



      {'Chiefs': 0.24, 'The Mule': 4.28, 'KFC Firelog': 0.62, 'Yeezy': 1.46, 
      'Chiko Juan': 0.96, 'Dragon Ball Super: Broly full movie': 0.0, 'Roma':
      4.28, 'Imagine Ariana Grande': 4.28, 'Canelo vs Rocky': 4.28, 'Chiefs
      schedule': 4.28, 'Nfl playoff picture 2019': 0.0, 'Mick Mulvaney': 4.28,
      'Nancy Wilson': 4.28, 'Johnson and Johnson': 4.28, 'JWoww': 4.28,
      'Jabari Parker': 4.28, 'Chick-fil-A': 4.28, 'Fuller House': 2.82, 'Artie
      Lange': 4.28, 'Al Roker': 4.28, 'Obamacare': 0.0, 'Bill Fralic': 4.28,
      'Sandy Hook': 4.28, 'Aus vs Ind': 4.28, "Shareef O'Neal": 4.28,
      'Jennifer Lawrence': 0.0, 'Alcorn State': 0.0, 'Spider-Man Spider-
      Verse': 0.0, 'NFL scores': 0.0, 'Pete Davidson': 0.0, 'Chivas': 0.0,
      'Juventus': 0.0, 'NFL games': 0.0, 'Fresno State football': 0.0,
      'Browns': 0.0, 'Lakers vs Hornets': 0.0, 'Matt Damon': 0.0, 'DAZN': 0.0,
      'Stoney Westmoreland': 0.0, 'Canelo fight December 15, 2018': 0.0, 'UNC
      basketball': 0.0, 'NFL': 0.0, 'Texans': 0.0, 'Nebraska volleyball': 0.0,
      'NFL schedule': 0.0, 'Chicago Bears': 0.09, 'Offset Cardi B': 0.09,
      'Appalachian State football': 0.09, '76ers vs Cavaliers': 0.09}
      [Finished in 0.4s]









      share|improve this question













      Program:
      I have designed a program in python that does the following (in order):




      1. Webscrapes google trends for the most popular searches

      2. Stores the searches and traffic of the searches in an in-memory database along with a timestamp

      3. Finds the trends that no longer exist (I.E the trends that are no longer in googles top 20 most popular) and then searches the database for the rows in which the trend first appeared and when it disappeared in order to find how long it has been popular (I.E the duration the trend spent in googles top 20 most popular searches)


      Context:
      I have only been programming for a month now and my experience is entirely limited to python. As a result, I'm 100% certain that I'm using atrocious practices and my code is super inefficient. Hence, these are my two big concerns: 1) My program is extremely slow. 2) It's clunky and looks unpleasant to a professional.



      Issues:
      Here I will try to be as comprehensive as possible as to what I think is wrong with my program:



      It is slow. Particularly in my third script (which corresponds to the third step) there is a giant block of code in the function inst_range. I used two while loops to search from either end of the table. If the program finds the search while counting up the rows it treats the search as the first instance of the search that exists. Likewise, if it finds the search going backwards through the table it treats it as the last instance of the search that exists. Then it computes the time between these two rows to find how long the search has been popular. What is the most efficient, clear way I could express this in my code while maintaining speed?



      I use bad practices / it looks unappealing My third script is the main culprit again. It looks very clunky to me. When I look at really good github projects they are all so readable and nice. How do I apply that to my project? Another thing I worry about is alot of the complex stuff I don't yet understand. I.E caching, threading, and different types of sorting which I should probably be using but I don't know about. Sorry to tack this bit on as well, but should I be using more OOP in my program? Is it messed up?



      The code:



      trend_parser.py (corresponds to step 1)



      import re
      import requests
      from bs4 import BeautifulSoup

      def fetch_xml(country_code):
      """Initialises a request to the Google Trends RSS feed

      Returns:
      string: Unparsed html stored as a string for use by BeautifulSoup
      """
      try:
      url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
      response = requests.get(url)
      return response.content
      except requests.exceptions.ConnectionError:
      print("failed to connect")

      def trends_retriever(country_code):
      """Parses the Google Trends RSS feed using BeautifulSoup.

      Returns:
      dict: Trend title for key, trend approximate traffic for value.
      """
      xml_document = fetch_xml(country_code)
      soup = BeautifulSoup(xml_document, "lxml")
      titles = soup.find_all("title")
      approximate_traffic = soup.find_all("ht:approx_traffic")
      return {title.text: re.sub("[+,]", "", traffic.text)
      for title, traffic in zip(titles[1:], approximate_traffic)}


      models.py (corresponds to step 2)



      from trend_parser import trends_retriever
      from flask import Flask
      from datetime import datetime
      import os

      from flask_sqlalchemy import SQLAlchemy

      app = Flask(__name__)
      app.config["SQLALCHEMY_TRACK_MODIFICATIONS"] = False
      app.config["SQLALCHEMY_DATABASE_URI"] = "sqlite:///trends.sqlite3"
      app.config["SECRET_KEY"] = os.urandom(16)

      db = SQLAlchemy(app)

      class trends(db.Model):

      id = db.Column(db.Integer, primary_key = True)
      title = db.Column(db.String, nullable = False)
      traffic = db.Column(db.Integer, nullable = False)
      time = db.Column(db.DateTime, nullable = False)

      def __init__(self):
      """These are the parameters that are passed on to the database"""
      self.title = f"{list(trends_retriever('US').keys())}"
      self.traffic = f"{list(trends_retriever('US').values())}"
      self.time = datetime.now()


      database_crawler.py (corresponds to step 3)



      from models import trends
      import ast
      import itertools

      queried_titles = [ast.literal_eval(item.title) for item in trends.query.all()]
      queried_time = [item.time for item in trends.query.all()]


      def title_query():
      for index, item in enumerate(itertools.count()):
      try:
      first_row = set(queried_titles[index])
      second_row = set(queried_titles[index + 1])
      removed_trend = first_row - second_row
      yield removed_trend
      except IndexError:
      break


      def inst_range():
      removed_trends = list(itertools.chain(*title_query()))
      row_count, row_count2 = 0, -1
      for item in removed_trends:
      while item not in queried_titles[row_count]:
      row_count += 1
      if item in queried_titles[row_count]:
      first_instance = row_count
      row_count = 0
      while item not in queried_titles[row_count2]:
      row_count2 -= 1
      if item in queried_titles[row_count2]:
      last_instance = len(queried_titles) - abs(row_count2)
      row_count2 = -1
      first_time = queried_time[first_instance]
      last_time = queried_time[last_instance]
      time_difference = (last_time - first_time).total_seconds() / 3600
      yield item, round(time_difference, 2)

      print(dict(inst_range()))


      Outputs:



      Visualisation of database



      Output of the third script (value corresponds to duration of staying in the top 20 searches, key corresponds to title of search)



      {'Chiefs': 0.24, 'The Mule': 4.28, 'KFC Firelog': 0.62, 'Yeezy': 1.46, 
      'Chiko Juan': 0.96, 'Dragon Ball Super: Broly full movie': 0.0, 'Roma':
      4.28, 'Imagine Ariana Grande': 4.28, 'Canelo vs Rocky': 4.28, 'Chiefs
      schedule': 4.28, 'Nfl playoff picture 2019': 0.0, 'Mick Mulvaney': 4.28,
      'Nancy Wilson': 4.28, 'Johnson and Johnson': 4.28, 'JWoww': 4.28,
      'Jabari Parker': 4.28, 'Chick-fil-A': 4.28, 'Fuller House': 2.82, 'Artie
      Lange': 4.28, 'Al Roker': 4.28, 'Obamacare': 0.0, 'Bill Fralic': 4.28,
      'Sandy Hook': 4.28, 'Aus vs Ind': 4.28, "Shareef O'Neal": 4.28,
      'Jennifer Lawrence': 0.0, 'Alcorn State': 0.0, 'Spider-Man Spider-
      Verse': 0.0, 'NFL scores': 0.0, 'Pete Davidson': 0.0, 'Chivas': 0.0,
      'Juventus': 0.0, 'NFL games': 0.0, 'Fresno State football': 0.0,
      'Browns': 0.0, 'Lakers vs Hornets': 0.0, 'Matt Damon': 0.0, 'DAZN': 0.0,
      'Stoney Westmoreland': 0.0, 'Canelo fight December 15, 2018': 0.0, 'UNC
      basketball': 0.0, 'NFL': 0.0, 'Texans': 0.0, 'Nebraska volleyball': 0.0,
      'NFL schedule': 0.0, 'Chicago Bears': 0.09, 'Offset Cardi B': 0.09,
      'Appalachian State football': 0.09, '76ers vs Cavaliers': 0.09}
      [Finished in 0.4s]






      python python-3.x database sqlalchemy






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 6 hours ago









      Ranch Mayonaise

      353




      353






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          3
          down vote



          accepted










          First of all, to me this code looks clean, easy to read, so overall, nicely done!



          Avoid repeated queries



          This piece of code runs the exact same query twice:




          queried_titles = [ast.literal_eval(item.title) for item in trends.query.all()]
          queried_time = [item.time for item in trends.query.all()]



          It would be better to run the query once and iterate over its result twice.
          Optionally, if not too inconvenient for the rest of code,
          a further, less important optimization would be to iterate over the results only once, and extract pairs of title and time in one pass.



          There is another form of double-querying in the constructor of trends:




          self.title = f"{list(trends_retriever('US').keys())}"
          self.traffic = f"{list(trends_retriever('US').values())}"



          trends_retriever('US') will fetch and parse an HTML document,
          so it will be good to only do it once.



          Are some trends removed twice?



          inst_range chains into a list the sets of trends returned by title_query.
          I didn't analyze deeply to fully understand,
          but isn't it possible that there can be duplicates in the resulting list?
          If so, it would better to chain into a set instead of a list.



          Do not use exception handling for flow control



          This code catches IndexError:




          for index, item in enumerate(itertools.count()):
          try:
          first_row = set(queried_titles[index])
          second_row = set(queried_titles[index + 1])
          removed_trend = first_row - second_row
          yield removed_trend
          except IndexError:
          break



          But there is nothing unexpected about IndexError,
          we know for certain it will be raised for the last entry of queried_title.



          Also note that item is not used in the loop.
          For unused loop variables the convention is to name them _.



          It seems to me that you can replace this loop with for index in range(len(queried_titles) - 1).



          Simplify logic



          I find it hard to understand the logic in inst_range.
          For example, row_count is incremented in a loop,
          until some condition is reached.
          During this loop, row_count is used as an index into a list,
          but it's never verified if this index is valid.
          In other words, it's not obvious that the loop is guaranteed to terminate before row_count goes out of range.
          Also it's not a "count", it's really an index.
          To add to the confusion, the variable is initialized before the outer for-loop,
          and conditionally reset to 0.



          Consider this alternative implementation, using more helper functions:



          for item in removed_trends:
          first_instance = find_first_instance(queried_titles, item)
          last_instance = find_last_instance(queried_titles, item)
          first_time = queried_time[first_instance]
          last_time = queried_time[last_instance]
          time_difference = (last_time - first_time).total_seconds() / 3600
          yield item, round(time_difference, 2)


          With the introduction of the helper methods find_first_instance and find_last_instance, the earlier doubts are simply eliminated.
          Of course, those functions need to be correctly implemented,
          but their complexity is now hidden from this higher level code,
          and better isolated, so that it will be probably easier to read.



          Beware of x in ... conditions on list



          The implementation of inst_range uses many x in ... conditions in lists.
          Keep in mind that searching in a list is an $O(n)$ operation:
          in the worst case, all elements will be checked.
          The performance penalty can become even worse when this is part of a nested loop.



          I haven't looked deeply into your code,
          but probably you can optimize this part by building two dictionaries:




          1. the first time a trend was seen

          2. the last time a trend was seen


          You could build these dictionaries by looping over trends and their times in one pass:




          • if trend not in first, then first[trend] = time

          • last[trend] = time


          This will be more efficient, because key in ... conditions on dictionaries is an $O(1)$ operation, and so is d[key] = ....






          share|improve this answer





















            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "196"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209798%2ffinding-the-duration-of-how-long-a-search-has-been-popular%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            3
            down vote



            accepted










            First of all, to me this code looks clean, easy to read, so overall, nicely done!



            Avoid repeated queries



            This piece of code runs the exact same query twice:




            queried_titles = [ast.literal_eval(item.title) for item in trends.query.all()]
            queried_time = [item.time for item in trends.query.all()]



            It would be better to run the query once and iterate over its result twice.
            Optionally, if not too inconvenient for the rest of code,
            a further, less important optimization would be to iterate over the results only once, and extract pairs of title and time in one pass.



            There is another form of double-querying in the constructor of trends:




            self.title = f"{list(trends_retriever('US').keys())}"
            self.traffic = f"{list(trends_retriever('US').values())}"



            trends_retriever('US') will fetch and parse an HTML document,
            so it will be good to only do it once.



            Are some trends removed twice?



            inst_range chains into a list the sets of trends returned by title_query.
            I didn't analyze deeply to fully understand,
            but isn't it possible that there can be duplicates in the resulting list?
            If so, it would better to chain into a set instead of a list.



            Do not use exception handling for flow control



            This code catches IndexError:




            for index, item in enumerate(itertools.count()):
            try:
            first_row = set(queried_titles[index])
            second_row = set(queried_titles[index + 1])
            removed_trend = first_row - second_row
            yield removed_trend
            except IndexError:
            break



            But there is nothing unexpected about IndexError,
            we know for certain it will be raised for the last entry of queried_title.



            Also note that item is not used in the loop.
            For unused loop variables the convention is to name them _.



            It seems to me that you can replace this loop with for index in range(len(queried_titles) - 1).



            Simplify logic



            I find it hard to understand the logic in inst_range.
            For example, row_count is incremented in a loop,
            until some condition is reached.
            During this loop, row_count is used as an index into a list,
            but it's never verified if this index is valid.
            In other words, it's not obvious that the loop is guaranteed to terminate before row_count goes out of range.
            Also it's not a "count", it's really an index.
            To add to the confusion, the variable is initialized before the outer for-loop,
            and conditionally reset to 0.



            Consider this alternative implementation, using more helper functions:



            for item in removed_trends:
            first_instance = find_first_instance(queried_titles, item)
            last_instance = find_last_instance(queried_titles, item)
            first_time = queried_time[first_instance]
            last_time = queried_time[last_instance]
            time_difference = (last_time - first_time).total_seconds() / 3600
            yield item, round(time_difference, 2)


            With the introduction of the helper methods find_first_instance and find_last_instance, the earlier doubts are simply eliminated.
            Of course, those functions need to be correctly implemented,
            but their complexity is now hidden from this higher level code,
            and better isolated, so that it will be probably easier to read.



            Beware of x in ... conditions on list



            The implementation of inst_range uses many x in ... conditions in lists.
            Keep in mind that searching in a list is an $O(n)$ operation:
            in the worst case, all elements will be checked.
            The performance penalty can become even worse when this is part of a nested loop.



            I haven't looked deeply into your code,
            but probably you can optimize this part by building two dictionaries:




            1. the first time a trend was seen

            2. the last time a trend was seen


            You could build these dictionaries by looping over trends and their times in one pass:




            • if trend not in first, then first[trend] = time

            • last[trend] = time


            This will be more efficient, because key in ... conditions on dictionaries is an $O(1)$ operation, and so is d[key] = ....






            share|improve this answer

























              up vote
              3
              down vote



              accepted










              First of all, to me this code looks clean, easy to read, so overall, nicely done!



              Avoid repeated queries



              This piece of code runs the exact same query twice:




              queried_titles = [ast.literal_eval(item.title) for item in trends.query.all()]
              queried_time = [item.time for item in trends.query.all()]



              It would be better to run the query once and iterate over its result twice.
              Optionally, if not too inconvenient for the rest of code,
              a further, less important optimization would be to iterate over the results only once, and extract pairs of title and time in one pass.



              There is another form of double-querying in the constructor of trends:




              self.title = f"{list(trends_retriever('US').keys())}"
              self.traffic = f"{list(trends_retriever('US').values())}"



              trends_retriever('US') will fetch and parse an HTML document,
              so it will be good to only do it once.



              Are some trends removed twice?



              inst_range chains into a list the sets of trends returned by title_query.
              I didn't analyze deeply to fully understand,
              but isn't it possible that there can be duplicates in the resulting list?
              If so, it would better to chain into a set instead of a list.



              Do not use exception handling for flow control



              This code catches IndexError:




              for index, item in enumerate(itertools.count()):
              try:
              first_row = set(queried_titles[index])
              second_row = set(queried_titles[index + 1])
              removed_trend = first_row - second_row
              yield removed_trend
              except IndexError:
              break



              But there is nothing unexpected about IndexError,
              we know for certain it will be raised for the last entry of queried_title.



              Also note that item is not used in the loop.
              For unused loop variables the convention is to name them _.



              It seems to me that you can replace this loop with for index in range(len(queried_titles) - 1).



              Simplify logic



              I find it hard to understand the logic in inst_range.
              For example, row_count is incremented in a loop,
              until some condition is reached.
              During this loop, row_count is used as an index into a list,
              but it's never verified if this index is valid.
              In other words, it's not obvious that the loop is guaranteed to terminate before row_count goes out of range.
              Also it's not a "count", it's really an index.
              To add to the confusion, the variable is initialized before the outer for-loop,
              and conditionally reset to 0.



              Consider this alternative implementation, using more helper functions:



              for item in removed_trends:
              first_instance = find_first_instance(queried_titles, item)
              last_instance = find_last_instance(queried_titles, item)
              first_time = queried_time[first_instance]
              last_time = queried_time[last_instance]
              time_difference = (last_time - first_time).total_seconds() / 3600
              yield item, round(time_difference, 2)


              With the introduction of the helper methods find_first_instance and find_last_instance, the earlier doubts are simply eliminated.
              Of course, those functions need to be correctly implemented,
              but their complexity is now hidden from this higher level code,
              and better isolated, so that it will be probably easier to read.



              Beware of x in ... conditions on list



              The implementation of inst_range uses many x in ... conditions in lists.
              Keep in mind that searching in a list is an $O(n)$ operation:
              in the worst case, all elements will be checked.
              The performance penalty can become even worse when this is part of a nested loop.



              I haven't looked deeply into your code,
              but probably you can optimize this part by building two dictionaries:




              1. the first time a trend was seen

              2. the last time a trend was seen


              You could build these dictionaries by looping over trends and their times in one pass:




              • if trend not in first, then first[trend] = time

              • last[trend] = time


              This will be more efficient, because key in ... conditions on dictionaries is an $O(1)$ operation, and so is d[key] = ....






              share|improve this answer























                up vote
                3
                down vote



                accepted







                up vote
                3
                down vote



                accepted






                First of all, to me this code looks clean, easy to read, so overall, nicely done!



                Avoid repeated queries



                This piece of code runs the exact same query twice:




                queried_titles = [ast.literal_eval(item.title) for item in trends.query.all()]
                queried_time = [item.time for item in trends.query.all()]



                It would be better to run the query once and iterate over its result twice.
                Optionally, if not too inconvenient for the rest of code,
                a further, less important optimization would be to iterate over the results only once, and extract pairs of title and time in one pass.



                There is another form of double-querying in the constructor of trends:




                self.title = f"{list(trends_retriever('US').keys())}"
                self.traffic = f"{list(trends_retriever('US').values())}"



                trends_retriever('US') will fetch and parse an HTML document,
                so it will be good to only do it once.



                Are some trends removed twice?



                inst_range chains into a list the sets of trends returned by title_query.
                I didn't analyze deeply to fully understand,
                but isn't it possible that there can be duplicates in the resulting list?
                If so, it would better to chain into a set instead of a list.



                Do not use exception handling for flow control



                This code catches IndexError:




                for index, item in enumerate(itertools.count()):
                try:
                first_row = set(queried_titles[index])
                second_row = set(queried_titles[index + 1])
                removed_trend = first_row - second_row
                yield removed_trend
                except IndexError:
                break



                But there is nothing unexpected about IndexError,
                we know for certain it will be raised for the last entry of queried_title.



                Also note that item is not used in the loop.
                For unused loop variables the convention is to name them _.



                It seems to me that you can replace this loop with for index in range(len(queried_titles) - 1).



                Simplify logic



                I find it hard to understand the logic in inst_range.
                For example, row_count is incremented in a loop,
                until some condition is reached.
                During this loop, row_count is used as an index into a list,
                but it's never verified if this index is valid.
                In other words, it's not obvious that the loop is guaranteed to terminate before row_count goes out of range.
                Also it's not a "count", it's really an index.
                To add to the confusion, the variable is initialized before the outer for-loop,
                and conditionally reset to 0.



                Consider this alternative implementation, using more helper functions:



                for item in removed_trends:
                first_instance = find_first_instance(queried_titles, item)
                last_instance = find_last_instance(queried_titles, item)
                first_time = queried_time[first_instance]
                last_time = queried_time[last_instance]
                time_difference = (last_time - first_time).total_seconds() / 3600
                yield item, round(time_difference, 2)


                With the introduction of the helper methods find_first_instance and find_last_instance, the earlier doubts are simply eliminated.
                Of course, those functions need to be correctly implemented,
                but their complexity is now hidden from this higher level code,
                and better isolated, so that it will be probably easier to read.



                Beware of x in ... conditions on list



                The implementation of inst_range uses many x in ... conditions in lists.
                Keep in mind that searching in a list is an $O(n)$ operation:
                in the worst case, all elements will be checked.
                The performance penalty can become even worse when this is part of a nested loop.



                I haven't looked deeply into your code,
                but probably you can optimize this part by building two dictionaries:




                1. the first time a trend was seen

                2. the last time a trend was seen


                You could build these dictionaries by looping over trends and their times in one pass:




                • if trend not in first, then first[trend] = time

                • last[trend] = time


                This will be more efficient, because key in ... conditions on dictionaries is an $O(1)$ operation, and so is d[key] = ....






                share|improve this answer












                First of all, to me this code looks clean, easy to read, so overall, nicely done!



                Avoid repeated queries



                This piece of code runs the exact same query twice:




                queried_titles = [ast.literal_eval(item.title) for item in trends.query.all()]
                queried_time = [item.time for item in trends.query.all()]



                It would be better to run the query once and iterate over its result twice.
                Optionally, if not too inconvenient for the rest of code,
                a further, less important optimization would be to iterate over the results only once, and extract pairs of title and time in one pass.



                There is another form of double-querying in the constructor of trends:




                self.title = f"{list(trends_retriever('US').keys())}"
                self.traffic = f"{list(trends_retriever('US').values())}"



                trends_retriever('US') will fetch and parse an HTML document,
                so it will be good to only do it once.



                Are some trends removed twice?



                inst_range chains into a list the sets of trends returned by title_query.
                I didn't analyze deeply to fully understand,
                but isn't it possible that there can be duplicates in the resulting list?
                If so, it would better to chain into a set instead of a list.



                Do not use exception handling for flow control



                This code catches IndexError:




                for index, item in enumerate(itertools.count()):
                try:
                first_row = set(queried_titles[index])
                second_row = set(queried_titles[index + 1])
                removed_trend = first_row - second_row
                yield removed_trend
                except IndexError:
                break



                But there is nothing unexpected about IndexError,
                we know for certain it will be raised for the last entry of queried_title.



                Also note that item is not used in the loop.
                For unused loop variables the convention is to name them _.



                It seems to me that you can replace this loop with for index in range(len(queried_titles) - 1).



                Simplify logic



                I find it hard to understand the logic in inst_range.
                For example, row_count is incremented in a loop,
                until some condition is reached.
                During this loop, row_count is used as an index into a list,
                but it's never verified if this index is valid.
                In other words, it's not obvious that the loop is guaranteed to terminate before row_count goes out of range.
                Also it's not a "count", it's really an index.
                To add to the confusion, the variable is initialized before the outer for-loop,
                and conditionally reset to 0.



                Consider this alternative implementation, using more helper functions:



                for item in removed_trends:
                first_instance = find_first_instance(queried_titles, item)
                last_instance = find_last_instance(queried_titles, item)
                first_time = queried_time[first_instance]
                last_time = queried_time[last_instance]
                time_difference = (last_time - first_time).total_seconds() / 3600
                yield item, round(time_difference, 2)


                With the introduction of the helper methods find_first_instance and find_last_instance, the earlier doubts are simply eliminated.
                Of course, those functions need to be correctly implemented,
                but their complexity is now hidden from this higher level code,
                and better isolated, so that it will be probably easier to read.



                Beware of x in ... conditions on list



                The implementation of inst_range uses many x in ... conditions in lists.
                Keep in mind that searching in a list is an $O(n)$ operation:
                in the worst case, all elements will be checked.
                The performance penalty can become even worse when this is part of a nested loop.



                I haven't looked deeply into your code,
                but probably you can optimize this part by building two dictionaries:




                1. the first time a trend was seen

                2. the last time a trend was seen


                You could build these dictionaries by looping over trends and their times in one pass:




                • if trend not in first, then first[trend] = time

                • last[trend] = time


                This will be more efficient, because key in ... conditions on dictionaries is an $O(1)$ operation, and so is d[key] = ....







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 3 hours ago









                janos

                96.9k12124350




                96.9k12124350






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Code Review Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209798%2ffinding-the-duration-of-how-long-a-search-has-been-popular%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    flock() on closed filehandle LOCK_FILE at /usr/bin/apt-mirror

                    Mangá

                     ⁒  ․,‪⁊‑⁙ ⁖, ⁇‒※‌, †,⁖‗‌⁝    ‾‸⁘,‖⁔⁣,⁂‾
”‑,‥–,‬ ,⁀‹⁋‴⁑ ‒ ,‴⁋”‼ ⁨,‷⁔„ ‰′,‐‚ ‥‡‎“‷⁃⁨⁅⁣,⁔
⁇‘⁔⁡⁏⁌⁡‿‶‏⁨ ⁣⁕⁖⁨⁩⁥‽⁀  ‴‬⁜‟ ⁃‣‧⁕‮ …‍⁨‴ ⁩,⁚⁖‫ ,‵ ⁀,‮⁝‣‣ ⁑  ⁂– ․, ‾‽ ‏⁁“⁗‸ ‾… ‹‡⁌⁎‸‘ ‡⁏⁌‪ ‵⁛ ‎⁨ ―⁦⁤⁄⁕