在一個很大的日誌文件中查找到出現最多的ip並且記錄次數

原文鏈接：https://blog.csdn.net/qq_41011894/article/details/88538872

在一個100G的日誌文件中, 查找到訪問最多的IP, 獲得前3個IP, 限制內存只有 1G, 不能使用MapReduce, 請使用Java實現

問題解析

既然內存只有1G 那麼就不能直接使用HashMap進行統計, 可以使用MapReduce原理, 先切片, 通過Hash碼進行分片, IP 相同的肯定在一個文件中, 分片不宜太大,也不宜太小, 就用1000片吧, 之後統計每個文件中出現最多次數的 IP, 合併到一個文件中, 最後統計合併的文件, 取最終結果

代碼片段1 生成日誌

  /**
     * 模擬生成日誌
     */
    public static void createFile() {
        // 先生成200000條ip信息 到文件
        for (int i = 0; i < 200000; i++) {
            final Random random = new Random();
            try {
                // 這裏使用FileUtils方便插入數據更方便
                FileUtils.write(new File("D:\\temp\\log.txt"), "192.168." + random.nextInt(256) + "." + random.nextInt(256) + "\n", "UTF-8", true);
            } catch (IOException e) {
                e.printStackTrace();
            }
            System.out.println("已生成IP: " + i);
        }
    }


代碼片段2    切片數據

 
    /**
     * 將數據切片 分配到小文件中
     */
    private static void cutBlock() {
        try {
            int i = 0;
            // 用BufferReader讀取每一行數據
            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("D:\\temp\\log.txt")));
            String line = null;
            while ((line = br.readLine()) != null) {
                System.out.println(++i);
                // 獲得hash碼 獲取對應的文件
                int hash = Objects.hash(line) % 1000;
                // hash有可能爲負數
                int fileIndex = (hash >= 0) ? hash : -hash;
                // 將數據寫入到片中
                FileUtils.write(new File("D:\\temp\\block" + fileIndex + ".txt"), line + "\n", "UTF-8", true);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }


代碼片段3    Map階段  對小文件進行排序取值

   /**
     * 將小文件進行排序取值
     */
    private static void map() {
        int f = 0;
        for (File file : new File("D:\\temp\\block").listFiles()) {
            System.out.println("已處理的文件數量: " + (++f));
            try {
                // 統計IP出現次數
                Map<String, Integer> map = new HashMap<>();
                FileUtils.readLines(file, "UTF-8").forEach(s -> {
                    if (map.containsKey(s)) {
                        map.put(s, map.get(s) + 1);
                    } else {
                        map.put(s, 1);
                    }
                });
                // 用Stream對Map進行倒排序 獲得IP出現次數最多的三條數據
                map.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(3).forEachOrdered(e -> {
                    try {
                        // 將前三的數據寫入到map文件中
                        FileUtils.write(new File("D:\\temp\\block\\map.txt"), e.getKey() + "\t" + e.getValue() + "\n", "UTF-8", true);
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }
                });
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }    


 代碼片段4    Reduce階段  將map的文件排序取最終結果

 
    /**
     * 將map數據進行排序 取最終結果
     */
    private static void reduce() {
        try {
            Map<String, Integer> map = new HashMap<>();
            // 獲得map中IP出現的次數
            FileUtils.readLines(new File("D:\\temp\\block\\map.txt"), "UTF-8").forEach(s -> {
                String[] split = s.split("\t");
                map.put(split[0], Integer.valueOf(split[1]));
            });
            // 用Stream對Map進行倒排序 獲得IP出現次數最多的三條數據
            map.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(3).forEach(e -> {
                try {
                    // 將前三數據寫到reduce文件中
                    FileUtils.write(new File("D:\\temp\\block\\reduce.txt"), e.getKey() + "\t" + e.getValue() + "\n", "UTF-8", true);
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            });
        } catch (IOException e) {
            e.printStackTrace();
        }
    }


代碼片段5    main方法測試

    public static void main(String[] args) {
        createFile();
        cutBlock();
        map();
        reduce();
    }


涉及到的問題

 1. MapReduce原理過程             傳送

 2. StreamApi對HashMap排序    傳送



結束

這就是對本題的講解  可能不是特別好的方法  感覺有用就點個贊吧 如果有錯誤或更好的方法評論區請多多指出  相互學習共同進步

在一個很大的日誌文件中查找到出現最多的ip並且記錄次數

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

生產者消費者3種實現方式

數據庫sql查詢

最新準備

synchronize底層原理：

在一個很大的日誌文件中查找到出現最多的ip並且記錄次數

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結