Repository: huangfox/dpkb Branch: main Commit: e62b935e9354 Files: 19 Total size: 35.5 KB Directory structure: gitextract_e8gsxs7x/ ├── README.md └── columns/ ├── doris/ │ ├── Doris全面解析.md │ └── Doris最佳实践.md ├── flink/ │ ├── Apache Flink 漫谈系列.md │ ├── Flink 相关论文.md │ ├── Flink实战系列.md │ ├── Flink开源项目汇总.md │ ├── Flink架构、源码分析专栏.md │ ├── Flink进阶教程.md │ └── Flink零基础入门.md ├── hive/ │ └── hive教程.md ├── kudu/ │ ├── Kudu原理论文.md │ └── 网易云Kudu技术文章.md ├── opensource/ │ └── 数仓相关开源项目汇总.md ├── presto/ │ ├── Presto最佳实践、调优、踩坑专栏.md │ ├── Presto架构、源码分析专栏.md │ └── Presto资料汇总、会议资讯专栏.md ├── spark/ │ └── Apache Spark的设计与实现.md └── starrocks/ └── StarRocks技术内幕.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: README.md ================================================ # DPKB 大数据相关知识库,主要包括: * 数据存储层、数据库(HDFS、Hive、HBase、Kudu、Doris、StarRocks、ClickHouse、TiDB等) * 数据处理层、OLAP引擎(Spark、Flink、Presto、Trino等) * 数据湖(IceBerg、Hudi、Delta等) * 大数据开发、应用(主要包括ETL、调度、数仓、数据应用等,例如Seatunnel、Dolphinscheduler等) * 数据治理(元数据管理、数据模型、数据标准、数据质量、数据安全等) 持续更新中(2024-12) ## 一、数据存储层、数据库(HDFS、Hive、HBase、Kudu、Doris、StarRocks、ClickHouse、TiDB等) ### ▶ HDFS ### ▶ Yarn #### 1)原理 - [Hadoop Yarn 一文搞懂 Yarn架构原理和工作机制](https://www.cnblogs.com/liangzilx/p/14837562.html) ### ▶ Hive #### 1)官网、社区、博客 - [Hive 官网](https://hive.apache.org/) #### 2)专栏 - [Hive 教程](columns/hive/hive教程.md) #### 3)大厂实践 - [HiveCube 在有赞的实践](https://tech.youzan.com/cube/) 2019-11 - [Hive Metastore Federation 在滴滴的实践](https://blog.didiyun.com/index.php/2019/03/25/hive-metastore-federation/) 2019-03 ### ▶ HBase #### 1)官网、社区、博客 - [HBase 官网](https://hbase.apache.org/) - [hbasefly](http://hbasefly.com/) #### 2)专栏 #### 3)大厂实践 #### 4)其他 - [HBase Bulkload 实践探讨](https://tech.youzan.com/hbase-bulkloadshi-practice/) 2019-12 ### ▶ Kudu #### 1)官网、社区、博客 - [Kudu 官网](https://kudu.apache.org/) #### 2)专栏 - [Kudu 原理 论文](columns/kudu/Kudu原理论文.md) - [网易云Kudu技术专栏](columns/kudu/网易云Kudu技术文章.md) #### 3)大厂实践 - [Apache Kudu 在网易的实践](https://www.infoq.cn/article/kgwyqb5wer5wl8cquweq) 2021-08 - [Apache Kudu 在网易实时数仓的实践](https://www.infoq.cn/article/QETxjyIu5tAJTZ9ksMdu) 2020-02 - [Kudu架构介绍及其在小米的应用实践](https://www.modb.pro/db/119708) 2017-06 #### 4) 其他 - [我是如何成为Apache Kudu committer & PMC 的?](https://cloud.tencent.com/developer/article/1450749) 2019-06 ### ▶ Doris #### 1)官网、社区、博客 - [Doris 官网](https://doris.apache.org/) - [Doris github](https://github.com/apache/doris) - [Doris 论坛](https://github.com/apache/incubator-doris/discussions) #### 2)专栏 - [Doris全面解析](columns/doris/Doris全面解析.md) - [Doris最佳实践](columns/doris/Doris最佳实践.md) #### 3)案例实践 - [Apache Doris在美团外卖数仓中的应用实践](https://tech.meituan.com/2020/04/09/doris-in-meituan-waimai.html) 2020-04 - [Apache Doris 在韵达物流领域的应用实践](https://mp.weixin.qq.com/s/Z_PhWk92ctZ7slz4SrVZ9Q) 2021-07 - [Apache Doris 在蜀海供应链的实践](https://mp.weixin.qq.com/s/SHuE-KCsIyh6jfo0DqLD6w) 2021-07 - [京东物流基于 Doris 的亿级数据自助探索应用](https://mp.weixin.qq.com/s/qVFa40yMg0_N9Lsb10ACQA) 2021-07 - [Doris on ES在快手商业化的最佳实践](https://mp.weixin.qq.com/s/5Pc5ewVFWPgauG4hNLH9xw) 2021-08 - [基于Doris的有道精品课数据中台建设实践](https://mp.weixin.qq.com/s/Gz-au9CHJ4lHrs5MkzeAJg) 2020-12 - [美团外卖实时数仓建设实践](https://mp.weixin.qq.com/s/-JPWqa_-at7F5hZ0zekVSQ) 2020-10 - [Doris在作业帮实时数仓中的应用&实践](https://mp.weixin.qq.com/s/hjbMM8CbElO04VLN5cfJtQ) 2020-09 - [基于Apache Doris的小米增长分析平台实践](https://mp.weixin.qq.com/s/WeNAItPJ4b7fsqW4kf0dSA) 2020-08 - [Apache Doris在京东双十一大促中的实践](https://mp.weixin.qq.com/s/8XnwJXm4kzq56SvElwL6kA) 2020-03 - [Apache Doris 在百度商业大规模微服务全链路监控的实践](https://mp.weixin.qq.com/s/k7CcCdHPTK1ZTDs_qKgh5w) 2020-02 ### ▶ StarRocks #### 1)官网、社区、博客 - [StarRocks](https://www.starrocks.com/zh-CN/index) - [StarRocks文档](https://docs.starrocks.com/zh-cn/main/introduction/StarRocks_intro) - [编程小梦 康凯森](https://blog.bcmeng.com/) #### 2) 专栏 - [StarRocks技术内幕](columns/starrocks/StarRocks技术内幕.md) ### ▶ ClickHouse #### 1)官网、社区、博客 - [ClickHouse 官网](https://clickhouse.com/) #### 2)专栏 #### 3)大厂实践 - [ClickHouse 在有赞的实践之路](https://tech.youzan.com/clickhouse-zai-you-zan-de-shi-jian-zhi-lu/) 2021-01 #### 4)其他 ## 二、数据处理层、OLAP引擎(Spark、Flink、Presto、Trino等) ### ▶ Spark #### 1)官网、社区、博客 - [Spark 官网](https://spark.apache.org/) #### 2)专栏 - [Apache Spark 的设计与实现](columns/spark/Apache%20Spark的设计与实现.md) #### 3)大厂实践 - [SparkSQL 在有赞的实践](https://tech.youzan.com/sparksql-in-youzan/) 2019-01 - [SparkSQL 在有赞大数据的实践(二)](https://tech.youzan.com/sparksql-in-youzan-2/) 2020-01 ### ▶ Flink #### 1)官网、社区、博客 - [Flink 官网](https://flink.apache.org/) - [Flink Confluence](https://cwiki.apache.org/confluence/display/FLINK/) - [Flink Blog](https://flink.apache.org/blog/) - [Ververica Blog](https://www.ververica.com/blog?hsLang=en) - [Ververica 中文](https://ververica.cn/developers-resources/) - [Flink 知识图谱](https://ververica.cn/wp-content/uploads/2020/03/Apache-Flink-Stateful-Computations-over-Data-Streams.pdf) - [Jark's Blog - WuChong - 云邪](http://wuchong.me/) #### 2)专栏 - [Flink 架构、源码分析专栏](columns/flink/Flink架构、源码分析专栏.md) - [Flink 实战系列](columns/flink/Flink实战系列.md) - [Flink 开源项目汇总](columns/flink/Flink开源项目汇总.md) ##### 教程 - [Flink SQL Cookbook - Ververica](https://github.com/ververica/flink-sql-cookbook/) - [Flink 零基础入门](columns/flink/Flink零基础入门.md) - [Flink 进阶教程](columns/flink/Flink进阶教程.md) - [Apache Flink 漫谈系列](columns/flink/Apache%20Flink%20漫谈系列.md) - [Flink 相关论文](columns/flink/Flink%20相关论文.md) #### 3)大厂实践 - [flink-forward-asia-hackathon-2021](https://github.com/flink-china/flink-forward-asia-hackathon-2021/issues) ### ▶ Presto、Trino #### 1)官网、社区、博客 - [PrestoDB 官网](https://prestodb.io/) - [Trino 官网](https://trino.io/) 原PrestoSql - [Google Presto Group](https://groups.google.com/g/presto-users) - [Presto 知乎专栏](https://www.zhihu.com/column/presto-cn) - [若飞-技术博客](http://armsword.com/archives/) #### 2)专栏 - [Presto 架构、源码分析专栏](columns/presto/Presto架构、源码分析专栏.md) - [Presto 最佳实践、调优、踩坑专栏](columns/presto/Presto最佳实践、调优、踩坑专栏.md) - [Presto 资料汇总、会议资讯专栏](columns/presto/Presto资料汇总、会议资讯专栏.md) #### 3)大厂实践 - [Presto 在车好多的实践](https://mp.weixin.qq.com/s/Bmqv54sVZgTqQ82I_RfmsA) 2020-12 - [Presto 在滴滴的探索与实践](https://zhuanlan.zhihu.com/p/266162270) 2020-10 - [Presto 在有赞的实践之路](https://tech.youzan.com/presto-zai-you-zan-de-shi-jian-zhi-lu/) 2020-04 - [PrestoCon 2020:云原生数据湖分析DLA的Presto实践](https://zhuanlan.zhihu.com/p/260784762) 2020-03 - [携程 Presto 技术演进之路](https://zhuanlan.zhihu.com/p/41538472) 2018-08 - [Presto 实现原理和美团的使用实践](https://tech.meituan.com/2014/06/16/presto.html) 2014-06 - [Presto 高性能引擎在美图的实践](https://zhuanlan.zhihu.com/p/408957032) 2021-09 ## 三、数据湖(IceBerg、Hudi、Delta等) - [一文看懂:什么是数据库、数据湖、数据仓库、湖仓一体、智能湖仓?](https://www.smartcity.team/consultingskills/experience/shujukuyushujuhu/#comments) 2021-08 ### ▶ Iceberg #### 1)官网、社区、博客 - [Iceberg 官网](https://iceberg.apache.org/) #### 2)应用 - [数据湖 Iceberg | 实时数据仓库的发展、架构和趋势](https://mp.weixin.qq.com/s?__biz=MzIwNTUxNTI1Ng==&mid=2247485623&idx=1&sn=9f03a36dbfc06c712b6132faabaa1dfd&chksm=972ef820a05971360311fd69c686e4b420222cfa639a1bcb5648bece4c3d886ae8f981712d8c&scene=21#wechat_redirect) 2021-03 - [数据湖 Iceberg | Apache Iceberg 快速入门](https://mp.weixin.qq.com/s?__biz=MzIwNTUxNTI1Ng==&mid=2247485637&idx=1&sn=0489f233e3bda2bcef221c9532bb001e&chksm=972ef852a0597144538b7807948443a27e58f99ba33d17a7bcb12ccb8b382fd1d712d6e80cbc&cur_album_id=1746684202856579076&scene=190#rd) 2021-03 - [数据湖 Iceberg | 如何正确使用 Iceberg](https://mp.weixin.qq.com/s?__biz=MzIwNTUxNTI1Ng==&mid=2247485644&idx=1&sn=b2194d8f3c1e7cf7e8e8d9296b9025e2&chksm=972ef85ba059714dc69472e3860497389f2ca4503d2cddeedd348695b5c314da49aad0278978&cur_album_id=1746684202856579076&scene=190#rd) 2021-04 - [数据湖 Iceberg | 在网易云音乐的实践](https://mp.weixin.qq.com/s?__biz=MzIwNTUxNTI1Ng==&mid=2247485718&idx=1&sn=34347ac54e97877e4401ad37f1d15577&chksm=972ef981a059709724b7abab56786ef047a68f31fd829031d2214fa4994b9ec0f1b04e25318c&cur_album_id=1746684202856579076&scene=190#rd) 2021-04 ### ▶ Hudi #### 1)官网、社区、博客 - [Hudi 官网](https://hudi.apache.org/) #### 2)应用 - [Flink CDC + Hudi + Hive + Presto 构建实时数据湖最佳实践](https://mp.weixin.qq.com/s/079VeDeIM_MQPyiiDX2l_w) ### ▶ Delta ## 四、大数据开发、应用(主要包括ETL、调度、数仓、数据应用等,例如Seatunnel、Dolphinscheduler等) ### ▶ Seatunnel ### ▶ DolphinScheduler ### ▶ 大数据架构 - [SQL on Hadoop 在快手大数据平台的实践与优化](https://www.infoq.cn/article/BN9cJjg1t-QSWE6fqkoR) 2019-06 - [携程机票大数据架构最佳实践](https://dbaplus.cn/news-73-1420-1.html) 2017-08 - [火山引擎DataLeap一站式数据治理解决方案及平台架构](https://www.cnblogs.com/bytedata/p/17745908.html) 2023-10 ### ▶ 数仓相关 - [有赞数据仓库实践之路](https://tech.youzan.com/dw-in-youzan/) 2020-03 - [OneData 建设探索之路:SaaS 收银运营数仓建设](https://tech.meituan.com/2019/10/17/meituan-saas-data-warehouse.html) 2019-10 - [面向AI技术的工程架构实践 | 贝壳一站式大数据开发平台实践](https://www.infoq.cn/article/mmnwzdlcyjg83qm0tgqm) 2020-11 ### ▶ 报表平台 - [有赞 BI 平台实现原理](https://tech.youzan.com/principle-on-bi-platform/) 2021-01 ## 五、数据治理(元数据管理、数据指标、数据标准、数据质量、数据安全等) ### ▶ 数据治理 - [美团配送数据治理实践](https://tech.meituan.com/2020/03/12/delivery-data-governance.html) 2020-03 - [全链路数据治理在网易严选的实践](https://www.infoq.cn/article/FOV6aEWRGNOfhD91YVcr) 2020-10 - [数据资产、数据治理 - 有赞](https://tech.youzan.com/shu-ju-zi-chan-zan-zhi-zhi-li/) 2019-11 - [美团酒旅起源数据治理平台的建设与实践](https://tech.meituan.com/2018/12/27/onedata-origin.html) 2018-12 - [滴滴数据仓库指标体系建设实践](https://mp.weixin.qq.com/s/-pLpLD_HMiasyyRxo5oTRQ) 2020-08 - [有赞指标库实践](https://tech.youzan.com/you-zan-zhi-biao-ku-shi-jian/) 2020-03 - [浅谈有赞大数据安全体系](https://tech.youzan.com/you-zan-da-shu-ju-an-quan-ti-xi-jian-she-shi-jian/) 2021-01 ### ▶ 元数据管理 - [字节跳动构建Data Catalog数据目录系统的实践](https://www.cnblogs.com/bytedata/p/16189474.html) 2022-04 - [有赞数据仓库元数据系统实践](https://tech.youzan.com/youzan-metadata/) 2018-08 - [饿了么元数据管理实践之路](https://dbaplus.cn/news-73-2143-1.html) 2018-07 - [数据治理方案技术调研 Atlas VS Datahub VS Amundsen](https://cloud.tencent.com/developer/article/1746714) 2020-11 - [数据资产治理-元数据采集那点事 - 有赞](https://tech.youzan.com/zi-chan-zhi-li-yuan-shu-ju-cai-ji-na-dian-shi/) 2020-12 - [来看看字节跳动内部的数据血缘用例与设计](https://segmentfault.com/a/1190000041452770) 2022-02 - [携程数据血缘构建及应用](https://mp.weixin.qq.com/s/LGK3YPZCe6oPTf48QaAIqA) 2021-09 - [Datahub](https://datahubproject.io/) A Metadata Platform for the Modern Data Stack ## 六、机器学习、AI ### ▶ 机器学习平台 - [机器学习平台建设指南](https://mp.weixin.qq.com/s/HEg_6Gly2WMrcPD5Ao2n6g) 2021-04 - [一站式机器学习平台建设实践](https://mp.weixin.qq.com/s/ZDRD0vAxkSqe4UeXi9avKQ) 2020-02 - [汽车之家机器学习平台的架构与实践](https://blog.csdn.net/hellozhxy/article/details/107210015) 2020-07 - [微博推荐算法实践与机器学习平台演进](https://blog.csdn.net/m0_37586850/article/details/116465255) 2021-05 - [爱奇艺机器学习平台的建设实践](https://mp.weixin.qq.com/s/Np4w7RC2JFlB7ZGIduu71w) 2020-11 - [爱奇艺一站式机器学习平台Deepthought的建设与初探](https://mp.weixin.qq.com/s?__biz=MzI0MjczMjM2NA==&mid=2247487206&idx=1&sn=c8db1e12378376722a1521f409149d44&chksm=e97692c5de011bd3f1b42a8112cd04c24907cb101ac5474b0054c95941ff5c4769a42d496f3a&scene=21#wechat_redirect) 2020-06 - [一站式机器学习平台在 vivo AI 的实践](https://www.infoq.cn/article/THlkStomYLRgXL2hzm8w) 2020-02 - [再见,Yarn!滴滴机器学习平台架构演进](https://mp.weixin.qq.com/s/iTfHv8EFx4O4G1sNxsuMkg) 2019-03 - [网易严选机器学习平台建设实践](https://www.6aiq.com/article/1661745581086) 2022 - [Sunfish-有赞智能平台实践](https://tech.youzan.com/sunfish/) 2020-06 - [同程-利用已有的大数据技术,如何构建机器学习平台](https://www.infoq.cn/news/build-machine-learning-platform-bigdata) 2017-11 ## 七、LLM应用 ### ▶ Text2SQL - [NL2SQL基础系列(1):业界顶尖排行榜、权威测评数据集及LLM大模型(Spider vs BIRD)全面对比优劣分析](https://blog.csdn.net/sinat_39620217/article/details/137603846) - [NL2SQL基础系列(2):主流大模型与微调方法精选集,Text2SQL经典算法技术回顾七年发展脉络梳理](https://blog.csdn.net/sinat_39620217/article/details/137603958) - [NL2SQL进阶系列(1):DB-GPT-Hub、SQLcoder、Text2SQL开源应用实践详解](https://blog.csdn.net/sinat_39620217/article/details/137674671) ## 八、资源汇总 ### ▶ 大厂技术博客 - [美团技术团队](https://tech.meituan.com/) - [有赞技术团队](https://tech.youzan.com/) - [滴滴云博客](https://blog.didiyun.com/) ### ▶ 大数据相关网站 - [dbaplus](https://dbaplus.cn/) ### ▶ 相关开源项目 - [数仓相关开源项目汇总](columns/opensource/数仓相关开源项目汇总.md) ### ▶ 相关论文 - [raft 中文翻译](https://github.com/maemual/raft-zh_cn/blob/master/raft-zh_cn.md) ================================================ FILE: columns/doris/Doris全面解析.md ================================================ # Doris全面解析 ## 原理 - [Apache Doris : 一个开源 MPP 数据库的架构与实践](https://www.jianshu.com/p/d3742af8ecce) ## 存储相关 - [存储层设计介绍1——存储结构设计解析](https://mp.weixin.qq.com/s/aJ3FwDI6KprYYUwXzhl_-A) 2020-07 - [存储层设计介绍2——写入流程、删除流程分析](https://mp.weixin.qq.com/s/xl4ePcsSVPPNQDGBw-KoKA) 2020-07 - [存储层设计介绍3——读取流程、Compaction流程分析](https://mp.weixin.qq.com/s/U9w3VxCKhTk_3Sglo9J-aA) 2020-08 - [Doris Compaction机制解析](https://mp.weixin.qq.com/s/5D1gAOEiFWM7N6KPwqHHdw) 2021-02 - [Apache Doris Parquet文件读取的设计与实现](https://mp.weixin.qq.com/s/5D6G_kvl9TzYCMIgynhERA) 2019-08 - [Doris核心功能介绍——数据模型和物化视图](https://mp.weixin.qq.com/s/eRUg1du8AQxLvqYjJ621fA) 2020-07 ## 计算相关 - [Apache Doris 查询原理](https://blog.bcmeng.com/post/apache-doris-query.html) 2020-03 - [Doris SQL 原理解析](https://mp.weixin.qq.com/s/v1jI1MxEHPT5czCWd0kRxw) 2021-01 - [Doris Stream Load原理解析](https://mp.weixin.qq.com/s/NUSHwAUsFskSXG5R0mw8kg) 2021-06 - [Apache Doris 索引机制解析](https://mp.weixin.qq.com/s/KdCdXb9Z3MdUZ5S0RV726Q) 2021-09 - [Spark Doris Sink的设计和实现](https://mp.weixin.qq.com/s/uoPLfFBv9Vt2gg9HEriR0Q) 2019-08 ## 其他 - [Doris基于Hive表的全局字典设计与实现](https://mp.weixin.qq.com/s/YlZnlMTTI8xhULmk1y-N6w) 2020-08 ================================================ FILE: columns/doris/Doris最佳实践.md ================================================ # Doris最佳实践 ## 调优 - [Compaction调优(1)](https://mp.weixin.qq.com/s/Kv71HomwNioHQDz8NUec1A) 2021-06 - [Compaction调优(2)](https://mp.weixin.qq.com/s/mJrxpvYIoE9rgP9Hvo1Dnw) 2021-06 - [Compaction调优(3)](https://mp.weixin.qq.com/s/cZmXEsNPeRMLHp379kc2aA) 2021-06 - [Apache Doris Join 实现与调优实践](https://mp.weixin.qq.com/s/pukjERSOW-D-BM4z1G9JlA) 2021-09 ## 业务实现 - [Apache Doris 基于 Bitmap的精确去重和用户行为分析](https://mp.weixin.qq.com/s/e0IrXgkinpeEDKi0etfGKA) 2020-01 - [Doris在用户画像人群业务的应用](https://mp.weixin.qq.com/s/HGyIgqCIIXfeJtNdKbj-fQ) 2020-10 ## 组件结合 - [基于 Iceberg 拓展 Doris 数据湖能力的实践](https://mp.weixin.qq.com/s/Vgo2kWED8oxg45x6zumEYQ) 2021-07 - [Flink 消费 Kafka 实时写入 Apache Doris(KFD)](https://mp.weixin.qq.com/s/nUeHwFBQs50EvPukqnrinQ) 2021-09 - [Spark Doris Connector的最佳实践](https://mp.weixin.qq.com/s/c8zE7ymv6jC1WTlV44dldQ) 2020-04 - [ProxySQL实现Doris FE高可用](https://mp.weixin.qq.com/s/XHgtIzekxkiGCjqcRbqndw) 2020-08 ## 其他 - [Apache Doris和ClickHouse的深度分析](https://mp.weixin.qq.com/s/fyVSRB3wxmsZUx4kY1eQRQ) 2021-10 ================================================ FILE: columns/flink/Apache Flink 漫谈系列.md ================================================ # Apache Flink 漫谈系列 (阿里云实时计算Flink) ## 教程 - [Apache Flink 漫谈系列(01) - 序](https://developer.aliyun.com/article/666043?spm=a2c6h.14164896.0.0.541b7cb2dQp6jL) - [Apache Flink 漫谈系列(02) - 概述](https://developer.aliyun.com/article/666052?spm=a2c6h.14164896.0.0.541b7cb2dQp6jL) - [Apache Flink 漫谈系列(03) - Watermark](https://developer.aliyun.com/article/666056?spm=a2c6h.14164896.0.0.541b7cb2dQp6jL) - [Apache Flink 漫谈系列(04) - State](https://developer.aliyun.com/article/667562?spm=a2c6h.14164896.0.0.541b7cb2dQp6jL) - [Apache Flink 漫谈系列(05) - Fault Tolerance](https://developer.aliyun.com/article/667564?spm=a2c6h.14164896.0.0.541b7cb2dQp6jL) - [Apache Flink 漫谈系列(06) - 流表对偶(duality)性](https://developer.aliyun.com/article/667566?spm=a2c6h.14164896.0.0.59817cb20Sk3GI) - [Apache Flink 漫谈系列(07) - 持续查询(Continuous Queries)](https://developer.aliyun.com/article/667700?spm=a2c6h.14164896.0.0.541b7cb2dQp6jL) - [Apache Flink 漫谈系列(08) - SQL概览](https://developer.aliyun.com/article/670202?spm=a2c6h.14164896.0.0.59817cb20Sk3GI) - [Apache Flink 漫谈系列(09) - JOIN 算子](https://developer.aliyun.com/article/672760?spm=a2c6h.14164896.0.0.59817cb20Sk3GI) - [Apache Flink 漫谈系列(10) - JOIN LATERAL](https://developer.aliyun.com/article/674345?spm=a2c6h.14164896.0.0.59817cb20Sk3GI) - [Apache Flink 漫谈系列(11) - Temporal Table JOIN](https://developer.aliyun.com/article/679659?spm=a2c6h.14164896.0.0.59817cb20Sk3GI) - [Apache Flink 漫谈系列(12) - Time Interval(Time-windowed) JOIN](https://developer.aliyun.com/article/683681?spm=a2c6h.14164896.0.0.541b7cb2dQp6jL) - [Apache Flink 漫谈系列(13) - Table API 概述](https://developer.aliyun.com/article/685085?spm=a2c6h.14164896.0.0.59817cb20Sk3GI) - [Apache Flink 漫谈系列(14) - DataStream Connectors之Kafka](https://developer.aliyun.com/article/686809?spm=a2c6h.14164896.0.0.541b7cb2dQp6jL) ## 资源 - [阿里云实时计算Flink](https://developer.aliyun.com/group/sc?spm=a2c6h.12873639.0.0.e12d59b2IvG4B2#/?_k=9flh5j) ================================================ FILE: columns/flink/Flink 相关论文.md ================================================ # Flink 相关论文 - [Distributed Snapshots: Determining Global States of Distributed Systems ](https://www.microsoft.com/en-us/research/uploads/prod/2016/12/Determining-Global-States-of-a-Distributed-System.pdf?ranMID=24542&ranEAID=J84DHJLQkR4&ranSiteID=J84DHJLQkR4-mVoVymFnAblBx3zwyf98Pw&epi=J84DHJLQkR4-mVoVymFnAblBx3zwyf98Pw&irgwc=1&OCID=AID2000142_aff_7593_1243925&tduid=%28ir__1hs2uuow6wkfq3oxkk0sohzzwm2xpc33lxd0o6g200%29%287593%29%281243925%29%28J84DHJLQkR4-mVoVymFnAblBx3zwyf98Pw%29%28%29&irclickid=_1hs2uuow6wkfq3oxkk0sohzzwm2xpc33lxd0o6g200) ================================================ FILE: columns/flink/Flink实战系列.md ================================================ # Flink实战系列 - [从零构建Flink SQL计算平台 - 1平台搭建概述](https://www.cnblogs.com/pyx0/p/12348114.html) - [从零构建Flink SQL计算平台 - 2实现作业提交](https://www.cnblogs.com/pyx0/p/12387509.html) - [从零构建Flink SQL计算平台 - 3实现校验和调试](https://www.cnblogs.com/pyx0/p/12441367.html) - [网易游戏基于 Flink 的流式 ETL 建设](http://www.whitewood.me/2020/12/20/%E7%BD%91%E6%98%93%E6%B8%B8%E6%88%8F%E5%9F%BA%E4%BA%8E-Flink-%E7%9A%84%E6%B5%81%E5%BC%8F-ETL-%E5%BB%BA%E8%AE%BE/) 2020-12 ================================================ FILE: columns/flink/Flink开源项目汇总.md ================================================ # Flink开源项目汇总 - [flink-sql-gateway](https://github.com/ververica/flink-sql-gateway#readme) - [flink-jdbc-driver](https://github.com/ververica/flink-jdbc-driver) - [flinkStreamSQL](https://github.com/DTStack/flinkStreamSQL) - [flinkx](https://github.com/DTStack/flinkx) - [waterdrop](https://github.com/InterestingLab/waterdrop) - [streamx](https://github.com/streamxhub/streamx) - [flink-streaming-platform-web](https://github.com/zhp8341/flink-streaming-platform-web) - [dlink](https://github.com/DataLinkDC/dlink) - [plink](https://github.com/hairless/plink) ================================================ FILE: columns/flink/Flink架构、源码分析专栏.md ================================================ # Flink架构、源码分析专栏 ## 流式计算原理 - [Streaming 101: The world beyond batch](https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/) - [Streaming 102: The world beyond batch](https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/) ## DataSet,DataStream ## Table,SQL ## Time,Watermark - [Flink Watermark 机制浅析](http://www.whitewood.me/2018/06/01/Flink-Watermark-%E6%9C%BA%E5%88%B6%E6%B5%85%E6%9E%90/) 2018-06 ## State - [Flink State 最佳实践](https://ververica.cn/developers/flink-state-best-practices/) 2020-04 ## Checkpoint,Savepoint - 关键词:Barrier非对齐 - [分布式快照算法: Chandy-Lamport 算法](https://zhuanlan.zhihu.com/p/53482103) 2020-11 - [Flink Checkpoint 原理流程以及常见失败原因分析](https://tech.youzan.com/flink_checkpoint_mechanism/) 2019-12 - [Flink 轻量级异步快照 ABS 实现原理](http://www.whitewood.me/2018/05/13/Flink-%E8%BD%BB%E9%87%8F%E7%BA%A7%E5%BC%82%E6%AD%A5%E5%BF%AB%E7%85%A7-ABS-%E5%AE%9E%E7%8E%B0%E5%8E%9F%E7%90%86/) 2018-05 - [Flink Checkpoint/Savepoint 差异](http://www.whitewood.me/2018/09/06/Flink-Checkpoint-Savepoint-%E5%B7%AE%E5%BC%82/) 2018-09 ## Operators ### Windows ### Joining ### ProcessFunction ## Connector - [漫谈 Flink Source 接口重构](http://www.whitewood.me/2020/02/11/%E6%BC%AB%E8%B0%88-Flink-Source-%E6%8E%A5%E5%8F%A3%E9%87%8D%E6%9E%84/) 2020-02 - [Flink JDBC Connector:Flink 与数据库集成最佳实践](https://developer.aliyun.com/article/776069) ## Flink On YARN - [Flink on YARN(上):一张图轻松掌握基础架构与启动流程](https://developer.aliyun.com/article/719262) - [Flink on YARN(下):常见问题与排查思路](https://developer.aliyun.com/article/719703) ================================================ FILE: columns/flink/Flink进阶教程.md ================================================ # Flink进阶教程 时间:2019 来源:Ververica中文社区 - [Apache Flink 进阶教程(一):Runtime 核心机制剖析](https://ververica.cn/developers/advanced-tutorial-1-analysis-of-the-core-mechanism-of-runtime/) - [Apache Flink 进阶教程(二):Time 深度解析](https://ververica.cn/developers/advanced-tutorial-2-time-depth-analysis/) - [Apache Flink 进阶教程(三):Checkpoint 的应用实践](https://ververica.cn/developers/advanced-tutorial-2-checkpoint-application-practice/) - [Apache Flink 进阶教程(四):Flink on Yarn/K8s 原理剖析及实践](https://ververica.cn/developers/advanced-tutorial-2-flink-on-yarn-k8s/) - [Apache Flink 进阶教程(五):数据类型和序列化](https://ververica.cn/developers/advanced-tutorial-2-serialize/) - [Apache Flink 进阶教程(六):Flink 作业执行深度解析](https://ververica.cn/developers/advanced-tutorial-2-flink-job-execution-depth-analysis/) - [Apache Flink 进阶教程(七):网络流控及反压剖析](https://ververica.cn/developers/advanced-tutorial-2-analysis-of-network-flow-control-and-back-pressure/) - [Apache Flink 进阶教程(八):详解 Metrics 原理与实战](https://ververica.cn/developers/advanced-tutorial-2-principles-and-practice-of-metrics/) ================================================ FILE: columns/flink/Flink零基础入门.md ================================================ # Flink零基础入门 时间:2019 来源:Ververica中文社区 - [Apache Flink 零基础入门(一&二):基础概念解析](https://ververica.cn/developers/flink-basic-tutorial-1-basic-concept/) - [Apache Flink 零基础入门(三):开发环境搭建和应用的配置、部署及运行](https://ververica.cn/developers/flink-basic-tutorial-1-environmental-construction/) - [Apache Flink 零基础入门(四):DataStream API 编程](https://ververica.cn/developers/apache-flink-basic-zero-iii-datastream-api-programming/) - [Apache Flink 零基础入门(五):客户端操作](https://ververica.cn/developers/apache-flink-zero-basic-introduction-iv-client-operation/) - [Apache Flink 零基础入门(六):Flink Time & Window 解析](https://ververica.cn/developers/time-window/) - [Apache Flink 零基础入门(七):状态管理及容错机制](https://ververica.cn/developers/state-management/) - [Apache Flink 零基础入门(八):Table API 编程](https://ververica.cn/developers/table-api-programming/) - [Apache Flink 零基础入门(九):Flink SQL 编程实践](https://ververica.cn/developers/flink-sql-programming-practice/) ================================================ FILE: columns/hive/hive教程.md ================================================ # Hive教程 ## Hive学习之路 2018 - [Hive学习之路 (一)Hive初识](https://www.cnblogs.com/qingyunzong/p/8707885.html) - [Hive学习之路 (二)Hive安装](https://www.cnblogs.com/qingyunzong/p/8708057.html) - [Hive学习之路 (三)Hive元数据信息对应MySQL数据库表](https://www.cnblogs.com/qingyunzong/p/8710356.html) - [Hive学习之路 (四)Hive的连接3种连接方式](https://www.cnblogs.com/qingyunzong/p/8715925.html) - [Hive学习之路 (五)DbVisualizer配置连接hive](https://www.cnblogs.com/qingyunzong/p/8715250.html) - [Hive学习之路 (六)Hive SQL之数据类型和存储格式](https://www.cnblogs.com/qingyunzong/p/8733924.html) - [Hive学习之路 (七)Hive的DDL操作](https://www.cnblogs.com/qingyunzong/p/8723271.html) - [Hive学习之路 (八)Hive中文乱码](https://www.cnblogs.com/qingyunzong/p/8724155.html) - [Hive学习之路 (九)Hive的内置函数](https://www.cnblogs.com/qingyunzong/p/8744593.html) - [Hive学习之路 (十)Hive的高级操作](https://www.cnblogs.com/qingyunzong/p/8746159.html) - [Hive学习之路 (十一)Hive的5个面试题](https://www.cnblogs.com/qingyunzong/p/8747656.html) - [Hive学习之路 (十二)Hive SQL练习之影评案例](https://www.cnblogs.com/qingyunzong/p/8727264.html) - [Hive学习之路 (十三)Hive分析窗口函数(一) SUM,AVG,MIN,MAX](https://www.cnblogs.com/qingyunzong/p/8782794.html) - [Hive学习之路 (十四)Hive分析窗口函数(二) NTILE,ROW_NUMBER,RANK,DENSE_RANK](https://www.cnblogs.com/qingyunzong/p/8798102.html) - [Hive学习之路 (十五)Hive分析窗口函数(三) CUME_DIST和PERCENT_RANK](https://www.cnblogs.com/qingyunzong/p/8798382.html) - [Hive学习之路 (十六)Hive分析窗口函数(四) LAG、LEAD、FIRST_VALUE和LAST_VALUE](https://www.cnblogs.com/qingyunzong/p/8798606.html) - [Hive学习之路 (十七)Hive分析窗口函数(五) GROUPING SETS、GROUPING__ID、CUBE和ROLLUP](https://www.cnblogs.com/qingyunzong/p/8798987.html) - [Hive学习之路 (十八)Hive的Shell操作](https://www.cnblogs.com/qingyunzong/p/8847532.html) - [Hive学习之路 (十九)Hive的数据倾斜](https://www.cnblogs.com/qingyunzong/p/8847597.html) - [Hive学习之路 (二十)Hive 执行过程实例分析](https://www.cnblogs.com/qingyunzong/p/8847651.html) - [Hive学习之路 (二十一)Hive 优化策略](https://www.cnblogs.com/qingyunzong/p/8847775.html) ================================================ FILE: columns/kudu/Kudu原理论文.md ================================================ # Kudu 原理 - [Apache Kudu Read & Write Paths](https://blog.cloudera.com/apache-kudu-read-write-paths/) 2017-04 - [Kudu存储原理](https://github.com/collabH/repository/blob/master/bigdata/olap/kudu/Kudu%E5%8E%9F%E7%90%86%E5%88%86%E6%9E%90.md) # Kudu 相关论文 - [LSM Tree](https://www.cs.umb.edu/~poneil/lsmtree.pdf) - [Kudu论文解读: Fast Analytics on Fast Data (上)](https://zhuanlan.zhihu.com/p/137238298) 2020-04 - [Kudu论文解读: Fast Analytics on Fast Data (下)](https://zhuanlan.zhihu.com/p/137243163) 2020-04 ================================================ FILE: columns/kudu/网易云Kudu技术文章.md ================================================ # 网易云Kudu技术文章 - [【大数据之数据仓库】选型流水记](https://sq.sf.163.com/blog/article/174995941069086720) 2018-07 - [【大数据之数据仓库】kudu客户端java驱动缺陷](https://sq.sf.163.com/blog/article/169595475122905088) 2018-06 - [【大数据之数据仓库】kudu性能测试报告分析](https://sq.sf.163.com/blog/article/174995336187535360) 2018-07 - [分布式存储系统 Kudu 与 HBase 的简要分析与对比](https://sq.163yun.com/blog/article/198870236065431552) 2018-11 - [【kudu pk parquet】runtime filter实践](https://sq.sf.163.com/blog/article/174993565549518848) 2018-07 - [【kudu pk parquet】TPC-H Query2对比解析](https://sq.sf.163.com/blog/article/175000124925075456) 2018-07 ================================================ FILE: columns/opensource/数仓相关开源项目汇总.md ================================================ # 数仓相关开源项目汇总 ## 元数据、数据治理 - [atlas](https://github.com/apache/atlas) - [datahub](https://github.com/linkedin/datahub) ## 数据集成 - [DataX](https://github.com/alibaba/DataX) - [datax-web](https://github.com/WeiYe-Jing/datax-web) ## 数据计算 - [streamx](https://github.com/streamxhub/streamx) - [plink](https://github.com/hairless/plink) Platform for Flink - [FlinkSQL](https://github.com/ambition119/FlinkSQL) - [flinkStreamSQL](https://github.com/DTStack/flinkStreamSQL) - [waterdrop](https://github.com/InterestingLab/waterdrop) ## 调度 - [dolphinscheduler](https://github.com/apache/dolphinscheduler) ## 开发平台、其他 - [davinci](https://github.com/edp963/davinci) - [DataSphereStudio](https://github.com/WeBankFinTech/DataSphereStudio) 微众银行 - [wormhole](https://github.com/edp963/wormhole) 宜信 - [big-whale](https://github.com/MeetYouDevs/big-whale) - [lark](https://github.com/wxgzgl/lark) ================================================ FILE: columns/presto/Presto最佳实践、调优、踩坑专栏.md ================================================ # Presto最佳实践、调优、踩坑专栏 ## 一、最佳实践 - [Presto的ETL之路](https://zhuanlan.zhihu.com/p/53996153) 2019-01 - [Presto的应用场景与企业案例](https://zhuanlan.zhihu.com/p/260653669) 2020-10 ### 1.1 技术选型 - [PrestoDB VS PrestoSQL发展比较](https://zhuanlan.zhihu.com/p/87621360) 2019-10 - [PrestoDB和PrestoSQL比较及选择](http://armsword.com/2020/05/02/the-difference-between-prestodb-and-prestosql/) 2020-05 ### 1.2 大厂实践 - [Presto在B站的实践](https://www.bilibili.com/read/cv16043517) 2022-04 - [Presto 在字节跳动的内部实践与优化(优化篇)](https://xie.infoq.cn/article/061bb0935a8575e01ea243852) 2021-12 - [Presto at Tencent at Scale - pdf](https://static.sched.com/hosted_files/prestocon2021/ed/Presto%20at%20Tencent%20at%20Scale%20%281%29.pdf) 2021-12 - [Presto在车好多的实践](https://mp.weixin.qq.com/s/Bmqv54sVZgTqQ82I_RfmsA) 2020-12 - [Presto在滴滴的探索与实践](https://zhuanlan.zhihu.com/p/266162270) 2020-10 - [Presto 在有赞的实践之路](https://tech.youzan.com/presto-zai-you-zan-de-shi-jian-zhi-lu/) 2020-04 - [PrestoCon 2020:云原生数据湖分析DLA的Presto实践](https://zhuanlan.zhihu.com/p/260784762) 2020-03 - [携程 Presto 技术演进之路](https://zhuanlan.zhihu.com/p/41538472) 2018-08 - [Presto实现原理和美团的使用实践](https://tech.meituan.com/2014/06/16/presto.html) 2014-06 - [阿里数据湖 Presto分析算力隔离技术剖析 ](https://mp.weixin.qq.com/s/lV_nzLI6_Ott7Abyaik_bw) ## 二、性能调优 - [Presto性能调优的五大技巧](https://zhuanlan.zhihu.com/p/162809568) 2020-07 - [Presto内存管理原理和调优](http://armsword.com/2018/05/22/the-memory-management-and-tuning-experience-of-presto/) 2018-05 - [Presto内存管理相关参数设置](http://armsword.com/2019/11/13/the-configuration-settings-of-presto-memory-management/) 2019-11 - [Presto集群内存不足时保护机制](http://armsword.com/2020/02/18/presto-memory-kill-policy/) 2020-02 - [火焰图在Presto YGC优化中的应用](https://mp.weixin.qq.com/s/BZG7Av5f9HH9gueVF8ABvQ) 2020-03 - [使用火焰图定位 OLAP 引擎瓶颈](https://mp.weixin.qq.com/s/pIYdeF0TtbGgV0Va35ejQg) 2021-03 - [How to Make The Presto Query Engine Run Fastest](https://ahana.io/learn/presto/making-the-presto-query-engine-run-faster/) ## 三、问题排查(踩坑) - [说下那些导致Presto查询变慢的JVM Bug和解决方法](http://armsword.com/2021/02/07/jvm-bug-causes-Presto-queries-to-slow-down/) 2021-02 - [Presto Master JVM Core问题调研](http://armsword.com/2020/12/10/solve-presto-jvm-coredump/) 2020-12 - [Jetty导致Presto堆外内存泄露的排查过程](http://armsword.com/2020/06/23/jetty-cause-presto-memory-leak/) 2020-06 - [记一次Presto Worker OOM的查找过程](http://armsword.com/2020/06/03/the-solution-of-presto-oom-caused-by-orc-statistics/) 2020-06 - [Presto System load过高问题调研](http://armsword.com/2019/09/18/solve-presto-system-load-too-high/) 2019-09 - [一次 Presto 的连接数超限的问题定位](https://zhuanlan.zhihu.com/p/57956341) 2019-03 - [Presto Codegen问题排查案例](https://zhuanlan.zhihu.com/p/66243773) 2019-05 - [Presto coordinator的CPU持续上涨,原因竟然是这样](https://mayunlei.github.io/2019/05/20/Presto-coordinator%E7%9A%84CPU%E6%8C%81%E7%BB%AD%E4%B8%8A%E6%B6%A8%EF%BC%8C%E5%8E%9F%E5%9B%A0%E7%AB%9F%E7%84%B6%E6%98%AF%E8%BF%99%E6%A0%B7/) 2019-05 - [Presto内存泄露问题调查](https://mayunlei.github.io/2019/09/02/Presto%E5%86%85%E5%AD%98%E6%B3%84%E9%9C%B2%E9%97%AE%E9%A2%98%E8%B0%83%E6%9F%A5/) 2019-09 ================================================ FILE: columns/presto/Presto架构、源码分析专栏.md ================================================ # Presto架构、源码分析专栏 ## 一、原理、架构 - [Presto概述:特性、原理、架构](https://zhuanlan.zhihu.com/p/260399749) 2020-10 - [分布式SQL查询引擎Presto原理介绍](http://armsword.com/2017/12/05/presto/) 2017-12 - [深入理解Presto](https://zhuanlan.zhihu.com/p/101366898) 2020-01 - [分布式SQL查询引擎原理(以Presto SQL为例)](https://zhuanlan.zhihu.com/p/293775390) 2020-11 - [深入理解Presto,Presto的内部架构](https://mayunlei.github.io/2020/08/16/%E6%B7%B1%E5%85%A5%E7%90%86%E8%A7%A3Presto-Presto%E7%9A%84%E5%86%85%E9%83%A8%E6%9E%B6%E6%9E%84/) 2020-08 - [Presto 分布式SQL查询引擎及原理分析](https://mp.weixin.qq.com/s?__biz=MzI5MDEzMzg5Nw==&mid=2660400264&idx=1&sn=ebff65980ef45f7dffea1e5ec7d51fdc&chksm=f7425e6ec035d778dcc5704babe5241d8c80f3d21059434b00d8d4c46d9ce0bd232467ec92a6&scene=21#wechat_redirect) 2020-05 ## 二、源码分析 ### 2.1 前期准备 - [如何快速掌握Presto源码:思路和经验](https://zhuanlan.zhihu.com/p/262236892) 2020-10 - [Presto 源码阅读: Overview](https://zhuanlan.zhihu.com/p/51393518) 2018-12 - [Presto的一些基本概念](http://armsword.com/2018/08/11/the-basic-concepts-of-presto/) 2018-08 - [Presto/Trino权威指南及官方设计文档解读](https://www.jianshu.com/p/d3600d2a115d) 2021-05 ### 2.2 数据类型、Query Execution Model - [Presto类型系统初探](https://zhuanlan.zhihu.com/p/55299409) 2019-01 - [Presto源码分析之数据类型](https://zhuanlan.zhihu.com/p/52713533) 2018-12 - [Presto Core Data Structures: Slice, Block & Page](https://zhuanlan.zhihu.com/p/60813087) 2019-03 - [Presto源码分析之Slice](https://zhuanlan.zhihu.com/p/52735465) 2018-12 - [Presto Driver,Split and Pipeline](https://www.lewuathe.com/presto-driver,split-and-pipeline.html) 2017-05 ### 2.3 SQL解析、执行计划生成与优化 - [Presto 源码分析:Coordinator 篇](https://www.infoq.cn/article/VNe0A9yKszPCmp32akCa) 2019-12 - [Presto SQL Parser源码分析](https://zhuanlan.zhihu.com/p/57438825) 2019-02 - [Presto 源码阅读:Optimizers](https://zhuanlan.zhihu.com/p/52154130) 2019-01 - [Presto逻辑执行计划生成](https://zhuanlan.zhihu.com/p/57395047) 2019-02 - [Presto源码分析之IterativeOptimizer](https://zhuanlan.zhihu.com/p/52879375) 2018-12 - [Presto源码分析之模式匹配](https://zhuanlan.zhihu.com/p/52916774) 2018-12 - [Presto技术源码解析总结-一个SQL的奇幻之旅 上](https://www.jianshu.com/p/3fccfa82e1ec) 2019-04 - [Presto技术源码解析总结-一个SQL的奇幻之旅 下](https://www.jianshu.com/p/d8a3d7488358) 2019-04 - [Presto查询执行过程和索引条件下推分析](https://mp.weixin.qq.com/s?src=11×tamp=1616394200&ver=2961&signature=E7fzfl-wO5wGpohLLkE8v9hRKn5GR1TbVwU-N6Hl11T0Xl6TtlgCbhJmisPs*Z-hYiprO0yYK91O5GR0m-V-s5kvv6NudfeWMGW4iPXdAdetAfDAo4EITB9l*yZajiJS&new=1) 2020-05 ### 2.4 分布式任务调度、split生成与调度策略、worker选择策略 - [Presto运行时浅析](https://zhuanlan.zhihu.com/p/345733460) 2021-01 - [Presto源码阅读——如何获取Hive中的Metadata(HMS+HDFS)](https://blog.csdn.net/huang_quanlong/article/details/80380474) 2018-07 - [Presto如何构建和使用海量Hive Splits](https://zhuanlan.zhihu.com/p/344559757) 2021-01 - [Presto之Task执行框架](https://zhuanlan.zhihu.com/p/54172313) 2019-01 - [Presto 是如何 schedule task 的?](https://zhuanlan.zhihu.com/p/58959725) 2019-03 - [Presto 由Stage到Task的旅程](https://zhuanlan.zhihu.com/p/55785284) 2019-01 - [Presto调度task选择Worker方法](http://armsword.com/2020/04/08/presto-scheduling-task/) 2020-04 - [presto中的AllAtOnce与Phased](https://zhuanlan.zhihu.com/p/61656233) 2019-05 - [Presto 任务调度: 任务分配到哪里](https://mayunlei.github.io/2020/05/30/Presto-%E4%BB%BB%E5%8A%A1%E8%B0%83%E5%BA%A6%EF%BC%9A-%E4%BB%BB%E5%8A%A1%E5%88%86%E9%85%8D%E5%88%B0%E5%93%AA%E9%87%8C/) 2020-05 - [Presto Split 详解](https://blog.csdn.net/zhanyuanlin/article/details/109215177) ### 2.5 常用Operator分析、常用SQL底层实现原理 - [Window函数与WindowOperator源码解析](https://zhuanlan.zhihu.com/p/59550902) 2019-03 - [Presto中coalesce函数的实现与Expression Codegen](https://zhuanlan.zhihu.com/p/64131496) 2019-04 - [Presto Limit 类算子分析](https://zhuanlan.zhihu.com/p/62448395) 2019-04 - [Presto分页功能概述](https://zhuanlan.zhihu.com/p/57030465) 2019-02 #### join、shuffle - [Presto 数据如何进行shuffle](https://zhuanlan.zhihu.com/p/61565957) 2019-04 - [Presto中的Hash Join](https://zhuanlan.zhihu.com/p/54731892) 2019-03 #### 分组聚合 - [Presto中的分组聚合查询流程](https://zhuanlan.zhihu.com/p/54385845) 2019-01 - [深入理解Presto中的Group By查询](https://zhuanlan.zhihu.com/p/67742519) 2019-09 ### 2.6 Function、UDF ### 2.7 Connector机制、常用Connector分析 - [ORC & Presto](https://zhuanlan.zhihu.com/p/110013789) 2020-02 - [Presto ORC及其性能优化](http://armsword.com/2019/09/30/presto-orc-and-performance-optimization/) 2019-09 - [Presto Hive MetaStore相关代码分析](https://zhuanlan.zhihu.com/p/109033118) 2020-02 - [Presto Connector之SystemTable](https://zhuanlan.zhihu.com/p/60934739) 2019-03 - [如何让Presto可以连接Hbase?文中含Hbase-Connect开发详解](https://www.analysys.cn/article/detail/20019023) 2018-11 ### 2.8 其他 - [Presto源码分析之TupleDomain](https://zhuanlan.zhihu.com/p/53113638) 2018-12 - [Presto的缓存机制](https://zhuanlan.zhihu.com/p/196398077) 2020-08 - [Presto Caching](https://zhuanlan.zhihu.com/p/147769024) 2020-06 - [Presto Codegen简介与优化尝试](https://zhuanlan.zhihu.com/p/53469238) 2018-12 - [Presto Procedure](https://zhuanlan.zhihu.com/p/59159147) 2019-03 - [How is data inserted into Presto?](https://zhuanlan.zhihu.com/p/59846328) 2019-03 - [Presto兼容Hive SQL的一些改造工作](http://armsword.com/2019/03/31/presto-compatible-hive-syntax/) 2019-03 - [Presto Coordinator分布式改造](https://mayunlei.github.io/2019/11/26/Presto-Coordinator%E5%88%86%E5%B8%83%E5%BC%8F%E6%94%B9%E9%80%A0/) 2019-11 - [Visualize Execution Plan in Presto](https://www.lewuathe.com/visualize-execution-plan-in-presto.html) 2019-09 - [Presto兼容Hive隐式类型转换](https://mp.weixin.qq.com/s/1hn3nVBdBtBeiPl3wxvHfQ) 2021-02 - [Presto 标量函数注册和调用过程简述](https://mp.weixin.qq.com/s/vd65OVeIOH7YFQ0QOAmsUg) 2020-09 - [Presto 函数实现简述](https://mp.weixin.qq.com/s/1Z_qik61N3hKwWqG8QR69w) 2020-07 - [Improved Hive Bucketing](https://trino.io/blog/2019/05/29/improved-hive-bucketing.html) ## 三、相关论文 - [官方论文《Presto: SQL on everything》](https://trino.io/Presto_SQL_on_Everything.pdf) [中文翻译](https://www.jianshu.com/p/de0a1de9f26e) - [《F1 Query: Declarative Querying at Scale》读后感](https://zhuanlan.zhihu.com/p/53299556) 2018-12 - [《Column-Stores vs. Row-Stores》读后感](https://zhuanlan.zhihu.com/p/54433448) 2019-01 abei-知乎 - [读后感之《Column-Stores vs. Row-Stores》](https://zhuanlan.zhihu.com/p/54484592) 2019-01 萌豆-知乎 - [Wander Join:Online Aggregation via Random Walks读后感](https://zhuanlan.zhihu.com/p/55050773) 2020-03 - [《The Snowflake Elastic Data Warehouse》读后感](https://zhuanlan.zhihu.com/p/55577067) 2019-01 ================================================ FILE: columns/presto/Presto资料汇总、会议资讯专栏.md ================================================ # Presto资料汇总、会议资讯专栏 ## 一、官网、技术博客 ### 1.1 官网 - [PrestoDB 官网](https://prestodb.io/) - [Trino 官网](https://trino.io/) 原PrestoSql - [PrestoDB Blog](https://prestodb.io/blog/index.html) - [Trino Blog](https://trino.io/blog/) - [PrestoDB github](https://github.com/prestodb/presto) - [Trino github](https://github.com/trinodb/trino) ### 1.2 讨论区(群组、公众号等) - [Google Presto Group](https://groups.google.com/g/presto-users) - [PrestoDB Slack](https://prestodb.slack.com) - [Trino Slack](https://trinodb.slack.com) - 公众号:Presto News - 公众号:FFCompute ### 1.3 技术博客 - [Presto知乎专栏](https://www.zhihu.com/column/presto-cn) - [若飞-技术博客](http://armsword.com/archives/) ## 二、书籍相关 - [《Presto: The Definitive Guide》](https://trino.io/blog/2020/04/11/the-definitive-guide.html) - [《Presto技术内幕》](https://book.douban.com/subject/26855863/) 京东Presto团队 ## 三、会议、资讯 ### 3.1 会议 - [Presto Meetup Oct 2019](https://zhuanlan.zhihu.com/p/88350254) 2019-10 - [PrestoCon 2020](https://prestocon2020.sched.com/) - [PrestoCon 2021](https://prestocon2021.sched.com/) - [PrestoCon 2022](https://prestocon2022.sched.com/) ### 3.2 资讯 - [惊闻Facebook开源大数据引擎Presto团队正在分裂](https://zhuanlan.zhihu.com/p/55628236) 2019-01 - [与 Facebook 分手后 ,PrestoSQL 再度因商标侵权被迫更名](https://www.infoq.cn/article/WmH0WXhqsWqpHDm6PpjC) 2021-01 ================================================ FILE: columns/spark/Apache Spark的设计与实现.md ================================================ # Apache Spark的设计与实现 > Spark Version: 1.0.2 Doc Version: 1.0.2.0 - [介绍](https://spark-internals.books.yourtion.com/index.html) - [概览](https://spark-internals.books.yourtion.com/markdown/1-Overview.html) - [Job 逻辑执行图](https://spark-internals.books.yourtion.com/markdown/2-JobLogicalPlan.html) - [Job 物理执行图](https://spark-internals.books.yourtion.com/markdown/3-JobPhysicalPlan.html) - [Shuffle 过程](https://spark-internals.books.yourtion.com/markdown/4-shuffleDetails.html) - [架构](https://spark-internals.books.yourtion.com/markdown/5-Architecture.html) - [Cache 和 Checkpoint](https://spark-internals.books.yourtion.com/markdown/6-CacheAndCheckpoint.html) - [Broadcast](https://spark-internals.books.yourtion.com/markdown/7-Broadcast.html) - [SparkInternals - github](https://github.com/JerryLead/SparkInternals) ================================================ FILE: columns/starrocks/StarRocks技术内幕.md ================================================ # StarRocks技术内幕 - [多表物化视图的设计与实现](https://blog.csdn.net/StarRocks/article/details/127863764) 2022-11