Home > NLP Project Details

Text and tabular data processing with regular expression and NLP techniques

Problem Context

Imagine the following scenario: You are a sales representative for a manufacturing company. When searching for potential customers online, you have to browse through numerous websites where customers express their need for products that your company may offer. Sometimes, there might be hundreds of such websites, as some business customers publish their requirements on their own sites. If a program could browse these websites for you, filter out irrelevant pages, and even extract details such as the specific products and quantities needed, it could potentially turn a day's work for a human into a matter of minutes for a computer. Although this idea has good intentions, its implementation is much more challenging than coming up with the concept itself. To partially solve this problem, I have utilized Python data processing toolkits, regular expressions, and simple Natural Language Processing (NLP) techniques.

Implementation

Many organizations publish their product requirements in the form of tables, which include specifications and quantities. To process this tabular data effectively, it becomes necessary to utilize Python packages like pandas. Below is a code snippet that I implemented to handle such tasks:


    def extract(bid_json):
      res = []
      try:
          df = pd.read_html(bid_json['elements_html'])
          res = None
          if len(df) == 1:
              e = 1
              for i in range(1, len(df[0])):
                  if type(df[0].iloc[i, 2]) == float:
                      if math.isnan(df[0].iloc[i, 2]):
                          e = i
                          break
              res = df[0].iloc[1:e, 3:]
              res = res.loc[:, ~res.iloc[1, :].duplicated()]
              res.columns = range(0, len(res.columns))
              res.index = range(0, len(res))
          elif len(df) > 4:
              if df[4].iloc[0, 0] == '序号':
                  res = df[4]
              else:
                  e = 1
                  for i in range(1, len(df[4])):
                      if type(df[4].iloc[i, 2]) == float:
                          if math.isnan(df[4].iloc[i, 2]):
                              e = i
                              break
                  res = df[4].iloc[1:e, 3:]
                  res = res.loc[:, ~res.iloc[1, :].duplicated()]
                  res.columns = range(0, len(res.columns))
                  res.index = range(0, len(res))
          if type(res) != pd.DataFrame or len(res.columns) == 0:
              if df[1].iloc[0, 0] == '序号':
                  res = df[1]
              else:
                  e = 1
                  for i in range(1, len(df[1])):
                      if type(df[1].iloc[i, 2]) == float:
                          if math.isnan(df[1].iloc[i, 2]):
                              e = i
                              break
                  res = df[1].iloc[1:e, 3:]
                  res = res.loc[:, ~res.iloc[1, :].duplicated()]
                  res.columns = range(0, len(res.columns))
                  res.index = range(0, len(res))
      except Exception as e:
          print(e)
      # print(res)
      products_list = []
      for i in range(1,len(res)):
          tmp_dict = {}
          for j in range(len(res.columns)):
              if res.iloc[0][j] in mapping_dict:
                  tmp_dict[mapping_dict[res.iloc[0][j]]] = res.iloc[i][j]
          products_list.append(tmp_dict)
      print(products_list)
      return products_list
  

When classifying a single line of text into several categories, large language models might be unnecessarily complex. Instead, traditional techniques like TF-IDF (Term Frequency-Inverse Document Frequency) combined with machine learning models can be more effective in solving the problem. A sample code snippet demonstrating this approach is shown below:


    train_texts = list(df.iloc[:, 1])
    train_labels = list(df.iloc[:, 2])
    self.vectorizer = TfidfVectorizer(use_idf=True, tokenizer=jieba.lcut)
    X = self.vectorizer.fit_transform(train_texts)
    self.clf = GaussianNB()
    self.clf.fit(X.toarray(), train_labels)
  

When we know that our target content consistently follows a specific pattern, it can be convenient to extract the desired information using regular expressions.


    from bs4 import BeautifulSoup
    import re

    text = BeautifulSoup(bid_json['elements_html'], 'html.parser').get_text().replace('xa0', '').replace('\xa0', '').replace('\n', '').replace('\r', '')
    name = re.findall(r'姓名.*?([\u4e00-\u9fa5]+)', text)
    deadline = re.findall(r'截止时间:*?(2.*?秒)', text)